Re: Re: Threads blocked by blockAddr()

dayzman Sat, 07 Feb 2009 07:12:25 -0800

Hi,

Thanks for the reply.

I've tried Fetcher2 but it seems to be even slower -- most of the threadsare put into a spinlock after just 1 or 2 levels. I see that some othersfind Fetcher2 quite slow too. Is there an alternative?


Thanks.

Michael

On Feb 7, 2009 9:53am, Andrzej Bialecki <[email protected]> wrote:

[email protected] wrote:


Hi,
I'm trying to crawl about 2mil pages from the web by using the Intranet

method, ie bin/nutch crawl dmoz/urls -dir crawl -threads 20 -depth5 -topN2000000. When it reaches about the 3rd level, most of the threads areblocked by blockAddr(). I see that blockAddr() blocks threads when thenumber of threads accessing the host exceeds fetcher.threads.per.host,which is set to 2 at the moment. By the 3rd level, about 100k pages wouldhave been crawled. My question is: why do the threads access just a fewhosts while there are (potentially) many other hosts for them to try? Isthere a way to tweak around this?

Use Fetcher2 instead of Fetcher - it uses a different queue and blocking

algorithm, which works better in such cases.




--

Best regards,

Andrzej Bialecki <>
___. ___ ___ ___ _ _ __________________________________

[__ || __|__/|__||\/| Information Retrieval, Semantic Web

___|||__|| \| || | Embedded Unix, System Integration

http://www.sigram.com Contact: info at sigram dot com

Re: Re: Threads blocked by blockAddr()

Reply via email to