Thanks Julien, I can confirm this patch works perfectly and does a good job of keeping a good crawl rate.
We have doubled the rate of information retrieval by using a time limit on the fetch queue. Thanks, Eran On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi guys, > > I've separated both functionalities into separate patches on JIRA > (NUTCH-769 > / NUTCH-770). > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/11/21 Julien Nioche <lists.digitalpeb...@gmail.com> > > > Hi Eran, > > > > There is currently no time limit implemented in the Fetcher. We > implemented > > one which worked quite well in combination with another mechanism which > > clears the URLs from a pool if more than x successive exceptions have > been > > encountered. This limits cases where a site or domain is not responsive. > > > > I might try and submit a patch if I find the time next week, our code has > > been heavily modified with the previous patches which have not been > > committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd > need > > to spend a bit of time extracting this specific functionality from the > rest. > > > > Best, > > > > Julien > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > > > 2009/11/21 Eran Zinman <zze...@gmail.com> > > > > Hi, > >> > >> We've been using Nutch for focused crawling (right now we are crawling > >> about > >> 50 domains). > >> > >> We've encountered the long-tail problem - We've set TopN to 100,000 and > >> generate.max.per.host to about 1500. > >> > >> 90% of all domains finish fetching after 30min, and the other 10% takes > an > >> additional 2.5 hours - making the slowest domain the bottleneck of the > >> entire fetch process. > >> > >> I've read Ken Krugler document and he's describing the same problem: > >> > >> > http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ > >> > >> I'm wondering - does anyone have a suggestion on what's the best way to > >> tackle this issue? > >> > >> I think that Ken suggested to limit the fetch time - for example say > >> "terminate after 1 hour, even if you are not done yet", is that feature > >> available in Nutch? > >> > >> I will be happy to try and contribute code if required! > >> > >> Thanks, > >> Eran > >> > > > > >