[ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ]
Matt Kangas commented on NUTCH-87: ---------------------------------- Sample plugin.xml file for use with WhitelistURLFilter <?xml version="1.0" encoding="UTF-8"?> <plugin id="epile-whitelisturlfilter" name="Epile whitelist URL filter" version="1.0.0" provider-name="teamgigabyte.com"> <extension-point id="org.apache.nutch.net.URLFilter" name="Nutch URL Filter"/> <runtime></runtime> <extension id="org.apache.nutch.net.urlfiler" name="Epile Whitelist URL Filter" point="org.apache.nutch.net.URLFilter"> <implementation id="WhitelistURLFilter" class="epile.crawl.plugin.WhitelistURLFilter"/> </extension> </plugin> > Efficient site-specific crawling for a large number of sites > ------------------------------------------------------------ > > Key: NUTCH-87 > URL: http://issues.apache.org/jira/browse/NUTCH-87 > Project: Nutch > Type: New Feature > Components: fetcher > Environment: cross-platform > Reporter: AJ Chen > Attachments: JIRA-87-whitelistfilter.tar.gz > > There is a gap between whole-web crawling and single (or handful) site > crawling. Many applications actually fall in this gap, which usually require > to crawl a large number of selected sites, say 100000 domains. Current > CrawlTool is designed for a handful of sites. So, this request calls for a > new feature or improvement on CrawTool so that "nutch crawl" command can > efficiently deal with large number of sites. One requirement is to add or > change smallest amount of code so that this feature can be implemented sooner > rather than later. > There is a discussion about adding a URLFilter to implement this requested > feature, see the following thread - > http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html > The idea is to use a hashtable in URLFilter for looking up regex for any > given domain. Hashtable will be much faster than list implementation > currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented > such idea before for his own application and is willing to make it available > for adaptation to Nutch. I'll be happy to help him in this regard. > But, before we do it, we would like to hear more discussions or comments > about this approach or other approaches. Particularly, let us know what > potential downside will be for hashtable lookup in a new URLFilter plugin. > AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira