[ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367731 ]
Fuad Efendi commented on NUTCH-100: ----------------------------------- Sorry, should be [&&] instead of [||] in previous comment > New plugin urlfilter-db > ----------------------- > > Key: NUTCH-100 > URL: http://issues.apache.org/jira/browse/NUTCH-100 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.8-dev > Environment: All Nutch versions > Reporter: Gal Nitzan > Priority: Trivial > Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz > > Hi, > I have written a new plugin, based on the URLFilter interface: urlfilter-db . > The purpose of this plugin is to filter domains, i.e. I would like to crawl > the world but to fetch only certain domains. > The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and > on the back-end a database. > For each url > filter is called > end for > filter > get the domain name from url > call cache.get domain > if not in cache try the database > if in database cache it and return it > return null > end filter > The plugin reads the cache size, jdbc driver, connection string, table to use > and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
