[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87: ----------------------------- Version: 0.7.2-dev 0.8-dev > Efficient site-specific crawling for a large number of sites > ------------------------------------------------------------ > > Key: NUTCH-87 > URL: http://issues.apache.org/jira/browse/NUTCH-87 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.8-dev, 0.7.2-dev > Environment: cross-platform > Reporter: AJ Chen > Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, > urlfilter-whitelist.tar.gz > > There is a gap between whole-web crawling and single (or handful) site > crawling. Many applications actually fall in this gap, which usually require > to crawl a large number of selected sites, say 100000 domains. Current > CrawlTool is designed for a handful of sites. So, this request calls for a > new feature or improvement on CrawTool so that "nutch crawl" command can > efficiently deal with large number of sites. One requirement is to add or > change smallest amount of code so that this feature can be implemented sooner > rather than later. > There is a discussion about adding a URLFilter to implement this requested > feature, see the following thread - > http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html > The idea is to use a hashtable in URLFilter for looking up regex for any > given domain. Hashtable will be much faster than list implementation > currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented > such idea before for his own application and is willing to make it available > for adaptation to Nutch. I'll be happy to help him in this regard. > But, before we do it, we would like to hear more discussions or comments > about this approach or other approaches. Particularly, let us know what > potential downside will be for hashtable lookup in a new URLFilter plugin. > AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira