[Nutch-general] Re: removing site from webdb

Rod Taylor Fri, 17 Mar 2006 12:17:02 -0800

On Fri, 2006-03-17 at 13:44 -0500, Insurance Squared Inc. wrote:
> We've got a site that is causing our crawl to slow dramatically, from 
> 20mbits down to about 3 or 4.  The basic problem is that the site seems 
> to consist of huge numbers of pages that aren't responding.  We can 
> remove the site from the index, but it seems like a problem to remove 
> this site permanently from the webdb so that we never fetch it again.  
> Is there an easy way in 0.71 to remove a site from the webdb, and then 
> keep it permanently removed?


You can add a filter on that domain to your regex-urlfilter.txt file, or
you can allow nutch to churn though each URL and mark it as invalid
individually.

This process can be done quite quickly if Nutch scales the number of
threads to achieve the best use of bandwidth.

Encourage the Nutch folks to apply this patch. I give it 50Mbits and
Nutch will scale up to 500 threads per task if most threads are hitting
bad pages or down to about 60 threads per task if they're downloading
large pages. In the end we stay within about 10% of the 50Mbit target.

http://issues.apache.org/jira/browse/NUTCH-207


-- 
Rod Taylor <[EMAIL PROTECTED]>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: removing site from webdb

Reply via email to