[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]

Matt Kangas updated NUTCH-87:
-----------------------------

    Attachment: build.xml.patch-0.8

The previous patch file is valid for 0.7. Here is one that works for 0.8-dev 
(trunk).

(It's three separate one-line additions, to include the plugin in the "deploy", 
"test" , and "clean" targets.)

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
> build.xml.patch-0.8, urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site 
> crawling. Many applications actually fall in this gap, which usually require 
> to crawl a large number of selected sites, say 100000 domains. Current 
> CrawlTool is designed for a handful of sites. So, this request calls for a 
> new feature or improvement on CrawTool so that "nutch crawl" command can 
> efficiently deal with large number of sites. One requirement is to add or 
> change smallest amount of code so that this feature can be implemented sooner 
> rather than later. 
> There is a discussion about adding a URLFilter to implement this requested 
> feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any 
> given domain. Hashtable will be much faster than list implementation 
> currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
> such idea before for his own application and is willing to make it available 
> for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments 
> about this approach or other approaches. Particularly, let us know what 
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to