hi, How is performance concern if the size of domain list reaches 10,000?
Micheal Ji, --- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote: > [ > http://issues.apache.org/jira/browse/NUTCH-100?page=all > ] > > Gal Nitzan updated NUTCH-100: > ----------------------------- > > type: Improvement (was: New Feature) > Description: > Hi, > > I have written a new plugin, based on the URLFilter > interface: urlfilter-db . > > The purpose of this plugin is to filter domains, > i.e. I would like to crawl the world but to fetch > only certain domains. > > The plugin uses a caching system (SwarmCache, easier > to deploy than JCS) and on the back-end a database. > > For each url > filter is called > end for > > filter > get the domain name from url > call cache.get domain > if not in cache try the database > if in database cache it and return it > return null > end filter > > > The plugin reads the cache size, jdbc driver, > connection string, table to use and domain field > from nutch-site.xml > > > was: > Hi, > > I have written (not much) a new plugin, based on the > URLFilter interface: urlfilter-db . > > The purpose of this plugin is to filter domains, > i.e. I would like to crawl the world but to fetch > only certain domains. > > The plugin uses a caching system (SwarmCache, easier > to deploy than JCS) and on the back-end a database. > > For each url > filter is called > end for > > filter > get the domain name from url > call cache.get domain > if not in cache try the database > if in database cache it and return it > return null > end filter > > > The plugin reads the cache size, jdbc driver, > connection string, table to use and domain field > from nutch-site.xml > > > Environment: All Nutch versions (was: MapRed) > > Fixed some issues > clean up > Added a patch for Subversion > > > New plugin urlfilter-db > > ----------------------- > > > > Key: NUTCH-100 > > URL: > http://issues.apache.org/jira/browse/NUTCH-100 > > Project: Nutch > > Type: Improvement > > Components: fetcher > > Versions: 0.8-dev > > Environment: All Nutch versions > > Reporter: Gal Nitzan > > Priority: Trivial > > Attachments: AddedDbURLFilter.patch, > urlfilter-db.tar.gz, urlfilter-db.tar.gz > > > > Hi, > > I have written a new plugin, based on the > URLFilter interface: urlfilter-db . > > The purpose of this plugin is to filter domains, > i.e. I would like to crawl the world but to fetch > only certain domains. > > The plugin uses a caching system (SwarmCache, > easier to deploy than JCS) and on the back-end a > database. > > For each url > > filter is called > > end for > > filter > > get the domain name from url > > call cache.get domain > > if not in cache try the database > > if in database cache it and return it > > return null > > end filter > > The plugin reads the cache size, jdbc driver, > connection string, table to use and domain field > from nutch-site.xml > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of > the administrators: > > http://issues.apache.org/jira/secure/Administrators.jspa > - > For more information on JIRA, see: > http://www.atlassian.com/software/jira > > __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/
