[ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367731 ]
Fuad Efendi commented on NUTCH-100: ----------------------------------- Sorry, should be [&&] instead of [||] in previous comment > New plugin urlfilter-db > ----------------------- > > Key: NUTCH-100 > URL: http://issues.apache.org/jira/browse/NUTCH-100 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.8-dev > Environment: All Nutch versions > Reporter: Gal Nitzan > Priority: Trivial > Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz > > Hi, > I have written a new plugin, based on the URLFilter interface: urlfilter-db . > The purpose of this plugin is to filter domains, i.e. I would like to crawl > the world but to fetch only certain domains. > The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and > on the back-end a database. > For each url > filter is called > end for > filter > get the domain name from url > call cache.get domain > if not in cache try the database > if in database cache it and return it > return null > end filter > The plugin reads the cache size, jdbc driver, connection string, table to use > and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
