[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Matt Kangas (JIRA) Thu, 19 Jan 2006 17:34:03 -0800

     [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]


Matt Kangas updated NUTCH-87:
-----------------------------

    Attachment: build.xml.patch-0.8

The previous patch file is valid for 0.7. Here is one that works for 0.8-dev 
(trunk).

(It's three separate one-line additions, to include the plugin in the "deploy", 
"test" , and "clean" targets.)

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
> build.xml.patch-0.8, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site 
> crawling. Many applications actually fall in this gap, which usually require 
> to crawl a large number of selected sites, say 100000 domains. Current 
> CrawlTool is designed for a handful of sites. So, this request calls for a 
> new feature or improvement on CrawTool so that "nutch crawl" command can 
> efficiently deal with large number of sites. One requirement is to add or 
> change smallest amount of code so that this feature can be implemented sooner 
> rather than later. 
> There is a discussion about adding a URLFilter to implement this requested 
> feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any 
> given domain. Hashtable will be much faster than list implementation 
> currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
> such idea before for his own application and is willing to make it available 
> for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments 
> about this approach or other approaches. Particularly, let us know what 
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Reply via email to