Hi Eran, There is currently no time limit implemented in the Fetcher. We implemented one which worked quite well in combination with another mechanism which clears the URLs from a pool if more than x successive exceptions have been encountered. This limits cases where a site or domain is not responsive.
I might try and submit a patch if I find the time next week, our code has been heavily modified with the previous patches which have not been committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need to spend a bit of time extracting this specific functionality from the rest. Best, Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/21 Eran Zinman <zze...@gmail.com> > Hi, > > We've been using Nutch for focused crawling (right now we are crawling > about > 50 domains). > > We've encountered the long-tail problem - We've set TopN to 100,000 and > generate.max.per.host to about 1500. > > 90% of all domains finish fetching after 30min, and the other 10% takes an > additional 2.5 hours - making the slowest domain the bottleneck of the > entire fetch process. > > I've read Ken Krugler document and he's describing the same problem: > > http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ > > I'm wondering - does anyone have a suggestion on what's the best way to > tackle this issue? > > I think that Ken suggested to limit the fetch time - for example say > "terminate after 1 hour, even if you are not done yet", is that feature > available in Nutch? > > I will be happy to try and contribute code if required! > > Thanks, > Eran >