Re: Nutch - Focused crawling

Julien Nioche Mon, 23 Nov 2009 03:29:32 -0800

Hi guys,

I've separated both functionalities into separate patches on JIRA (NUTCH-769
/ NUTCH-770).


Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/11/21 Julien Nioche <lists.digitalpeb...@gmail.com>

> Hi Eran,
>
> There is currently no time limit implemented in the Fetcher. We implemented
> one which worked quite well in combination with another mechanism which
> clears the URLs from a pool if more than x successive exceptions have been
> encountered. This limits cases where a site or domain is not responsive.
>
> I might try and submit a patch if I find the time next week, our code has
> been heavily modified with the previous patches which have not been
> committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need
> to spend a bit of time extracting this specific functionality from the rest.
>
> Best,
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> 2009/11/21 Eran Zinman <zze...@gmail.com>
>
> Hi,
>>
>> We've been using Nutch for focused crawling (right now we are crawling
>> about
>> 50 domains).
>>
>> We've encountered the long-tail problem - We've set TopN to 100,000 and
>> generate.max.per.host to about 1500.
>>
>> 90% of all domains finish fetching after 30min, and the other 10% takes an
>> additional 2.5 hours - making the slowest domain the bottleneck of the
>> entire fetch process.
>>
>> I've read Ken Krugler document and he's describing the same problem:
>>
>> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
>>
>> I'm wondering - does anyone have a suggestion on what's the best way to
>> tackle this issue?
>>
>> I think that Ken suggested to limit the fetch time - for example say
>> "terminate after 1 hour, even if you are not done yet", is that feature
>> available in Nutch?
>>
>> I will be happy to try and contribute code if required!
>>
>> Thanks,
>> Eran
>>
>
>

Re: Nutch - Focused crawling

Reply via email to