On Fri, 2006-03-17 at 13:44 -0500, Insurance Squared Inc. wrote: > We've got a site that is causing our crawl to slow dramatically, from > 20mbits down to about 3 or 4. The basic problem is that the site seems > to consist of huge numbers of pages that aren't responding. We can > remove the site from the index, but it seems like a problem to remove > this site permanently from the webdb so that we never fetch it again. > Is there an easy way in 0.71 to remove a site from the webdb, and then > keep it permanently removed?
You can add a filter on that domain to your regex-urlfilter.txt file, or you can allow nutch to churn though each URL and mark it as invalid individually. This process can be done quite quickly if Nutch scales the number of threads to achieve the best use of bandwidth. Encourage the Nutch folks to apply this patch. I give it 50Mbits and Nutch will scale up to 500 threads per task if most threads are hitting bad pages or down to about 60 threads per task if they're downloading large pages. In the end we stay within about 10% of the 50Mbit target. http://issues.apache.org/jira/browse/NUTCH-207 -- Rod Taylor <[EMAIL PROTECTED]> ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
