Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains).
We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5 hours - making the slowest domain the bottleneck of the entire fetch process. I've read Ken Krugler document and he's describing the same problem: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ I'm wondering - does anyone have a suggestion on what's the best way to tackle this issue? I think that Ken suggested to limit the fetch time - for example say "terminate after 1 hour, even if you are not done yet", is that feature available in Nutch? I will be happy to try and contribute code if required! Thanks, Eran