[Nutch-dev] [jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Matt Kangas (JIRA) Thu, 12 Jan 2006 17:53:04 -0800

     [ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]


Matt Kangas updated NUTCH-87:
-----------------------------

    Attachment: build.xml.patch
                urlfilter-whitelist.tar.gz

THIS REPLACES THE PREVIOUS TARBALL
SEE THE INCLUDED README.txt FOR USAGE GUIDELINES

Place both of these files into ~nutch/src/plugin, then:
- untar the tarball
- apply the patch to ~nutch/src/plugin/build.xml to permit urifilter-whitelist 
to be built

Next, cd ~nutch and build ("ant").

A JUnit test is included. It will be run automatically by "ant test-plugins".

Then follow the instructions in ~nutch/src/plugin/urlfilter-whitelist/README.txt

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, 
> urlfilter-whitelist.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site 
> crawling. Many applications actually fall in this gap, which usually require 
> to crawl a large number of selected sites, say 100000 domains. Current 
> CrawlTool is designed for a handful of sites. So, this request calls for a 
> new feature or improvement on CrawTool so that "nutch crawl" command can 
> efficiently deal with large number of sites. One requirement is to add or 
> change smallest amount of code so that this feature can be implemented sooner 
> rather than later. 
> There is a discussion about adding a URLFilter to implement this requested 
> feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any 
> given domain. Hashtable will be much faster than list implementation 
> currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
> such idea before for his own application and is willing to make it available 
> for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments 
> about this approach or other approaches. Particularly, let us know what 
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Reply via email to