Re: Nutch - Focused crawling

Julien Nioche Sat, 21 Nov 2009 07:51:34 -0800

Hi Eran,

There is currently no time limit implemented in the Fetcher. We implemented
one which worked quite well in combination with another mechanism which
clears the URLs from a pool if more than x successive exceptions have been
encountered. This limits cases where a site or domain is not responsive.


I might try and submit a patch if I find the time next week, our code has
been heavily modified with the previous patches which have not been
committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need
to spend a bit of time extracting this specific functionality from the rest.

Best,

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/11/21 Eran Zinman <zze...@gmail.com>

> Hi,
>
> We've been using Nutch for focused crawling (right now we are crawling
> about
> 50 domains).
>
> We've encountered the long-tail problem - We've set TopN to 100,000 and
> generate.max.per.host to about 1500.
>
> 90% of all domains finish fetching after 30min, and the other 10% takes an
> additional 2.5 hours - making the slowest domain the bottleneck of the
> entire fetch process.
>
> I've read Ken Krugler document and he's describing the same problem:
>
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
>
> I'm wondering - does anyone have a suggestion on what's the best way to
> tackle this issue?
>
> I think that Ken suggested to limit the fetch time - for example say
> "terminate after 1 hour, even if you are not done yet", is that feature
> available in Nutch?
>
> I will be happy to try and contribute code if required!
>
> Thanks,
> Eran
>

Re: Nutch - Focused crawling

Reply via email to