Hi,

We've been using Nutch for focused crawling (right now we are crawling about
50 domains).

We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.

90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5 hours - making the slowest domain the bottleneck of the
entire fetch process.

I've read Ken Krugler document and he's describing the same problem:
http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/

I'm wondering - does anyone have a suggestion on what's the best way to
tackle this issue?

I think that Ken suggested to limit the fetch time - for example say
"terminate after 1 hour, even if you are not done yet", is that feature
available in Nutch?

I will be happy to try and contribute code if required!

Thanks,
Eran

Reply via email to