[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87:
-----------------------------
Attachment: build.xml.patch-0.8
The previous patch file is valid for 0.7. Here is one that works for 0.8-dev
(trunk).
(It's three separate one-line additions, to include the plugin in the "deploy",
"test" , and "clean" targets.)
> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
> Key: NUTCH-87
> URL: http://issues.apache.org/jira/browse/NUTCH-87
> Project: Nutch
> Type: New Feature
> Components: fetcher
> Versions: 0.8-dev, 0.7.2-dev
> Environment: cross-platform
> Reporter: AJ Chen
> Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch,
> build.xml.patch-0.8, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site
> crawling. Many applications actually fall in this gap, which usually require
> to crawl a large number of selected sites, say 100000 domains. Current
> CrawlTool is designed for a handful of sites. So, this request calls for a
> new feature or improvement on CrawTool so that "nutch crawl" command can
> efficiently deal with large number of sites. One requirement is to add or
> change smallest amount of code so that this feature can be implemented sooner
> rather than later.
> There is a discussion about adding a URLFilter to implement this requested
> feature, see the following thread -
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any
> given domain. Hashtable will be much faster than list implementation
> currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented
> such idea before for his own application and is willing to make it available
> for adaptation to Nutch. I'll be happy to help him in this regard.
> But, before we do it, we would like to hear more discussions or comments
> about this approach or other approaches. Particularly, let us know what
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira