[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Matt Kangas (JIRA) Sat, 10 Sep 2005 22:30:32 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ]


Matt Kangas commented on NUTCH-87:
----------------------------------

Sample plugin.xml file for use with WhitelistURLFilter

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="epile-whitelisturlfilter"
   name="Epile whitelist URL filter"
   version="1.0.0"
   provider-name="teamgigabyte.com">

   <extension-point
      id="org.apache.nutch.net.URLFilter"
      name="Nutch URL Filter"/>

   <runtime></runtime>

   <extension id="org.apache.nutch.net.urlfiler"
      name="Epile Whitelist URL Filter"
      point="org.apache.nutch.net.URLFilter">
             
      <implementation id="WhitelistURLFilter"
         class="epile.crawl.plugin.WhitelistURLFilter"/>                    
   </extension>
</plugin>

> Efficient site-specific crawling for a large number of sites
> ------------------------------------------------------------
>
>          Key: NUTCH-87
>          URL: http://issues.apache.org/jira/browse/NUTCH-87
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>  Environment: cross-platform
>     Reporter: AJ Chen
>  Attachments: JIRA-87-whitelistfilter.tar.gz
>
> There is a gap between whole-web crawling and single (or handful) site 
> crawling. Many applications actually fall in this gap, which usually require 
> to crawl a large number of selected sites, say 100000 domains. Current 
> CrawlTool is designed for a handful of sites. So, this request calls for a 
> new feature or improvement on CrawTool so that "nutch crawl" command can 
> efficiently deal with large number of sites. One requirement is to add or 
> change smallest amount of code so that this feature can be implemented sooner 
> rather than later. 
> There is a discussion about adding a URLFilter to implement this requested 
> feature, see the following thread - 
> http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html
> The idea is to use a hashtable in URLFilter for looking up regex for any 
> given domain. Hashtable will be much faster than list implementation 
> currently used in RegexURLFilter.  Fortunately, Matt Kangas has implemented 
> such idea before for his own application and is willing to make it available 
> for adaptation to Nutch. I'll be happy to help him in this regard.  
> But, before we do it, we would like to hear more discussions or comments 
> about this approach or other approaches. Particularly, let us know what 
> potential downside will be for hashtable lookup in a new URLFilter plugin.
> AJ Chen

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

Reply via email to