Hi guys, I've separated both functionalities into separate patches on JIRA (NUTCH-769 / NUTCH-770).
Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/21 Julien Nioche <lists.digitalpeb...@gmail.com> > Hi Eran, > > There is currently no time limit implemented in the Fetcher. We implemented > one which worked quite well in combination with another mechanism which > clears the URLs from a pool if more than x successive exceptions have been > encountered. This limits cases where a site or domain is not responsive. > > I might try and submit a patch if I find the time next week, our code has > been heavily modified with the previous patches which have not been > committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need > to spend a bit of time extracting this specific functionality from the rest. > > Best, > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > > 2009/11/21 Eran Zinman <zze...@gmail.com> > > Hi, >> >> We've been using Nutch for focused crawling (right now we are crawling >> about >> 50 domains). >> >> We've encountered the long-tail problem - We've set TopN to 100,000 and >> generate.max.per.host to about 1500. >> >> 90% of all domains finish fetching after 30min, and the other 10% takes an >> additional 2.5 hours - making the slowest domain the bottleneck of the >> entire fetch process. >> >> I've read Ken Krugler document and he's describing the same problem: >> >> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ >> >> I'm wondering - does anyone have a suggestion on what's the best way to >> tackle this issue? >> >> I think that Ken suggested to limit the fetch time - for example say >> "terminate after 1 hour, even if you are not done yet", is that feature >> available in Nutch? >> >> I will be happy to try and contribute code if required! >> >> Thanks, >> Eran >> > >