Hi,

----- Original Message ----
From: Ken Krugler <[EMAIL PROTECTED]>

>On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>Hello,
>>
>>Several people reported issues with slow fetcher in 0.8...
>>
>>I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>fetch speed didn't increase when I went from using 100 threads, to 
>>200 threads.  Has anyone else observed the same?
>>
>>I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>and the aggregate fetch speed was between 20 and 40 pages/sec. 
>>This was a fetch of 50K+ URLs from a diverse set of servers.

<snip>

>>I saw Ken Krugle's email suggesting to increast the number of 
>>fetcher threads to 2000+ and set the maximal java thread stack size 
>>to 512k with -Xss.  Has anyone other than Ken tried this with 
>>success?  Wouldn't the JVM go crazy context switching between this 
>>many threads?

Note that most of the time these fetcher threads are all blocked, 
waiting for other threads that are already fetching from the same IP 
address. So there's not a lot of thrashing.

OG: I see.  But wouldn't that be true only in case of more vertical crawls, 
crawls that don't have a very large and diverse set of hosts?
In other words, if you are doing a web-wide crawl, each of those 2000 fetcher 
threads is very likely to be assigned to a host/IP that is currently not being 
crawled by any other fetcher thread, no?

>I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>well. However number of fetcher for my part is 2500+.. I had to play
>around with this number to match my bandwidth limitation, but now I
>maximize my full bandwidth. But the problem that I run into are the
>fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>Dennis Kubes posting on this).

Yes, these are definitely problems.

Stefan has been working on a queue-based fetcher that uses NIO. Seems 
very promising, but not yet ready for prime time.

OG: yeah, I saw his email.  Kelvin worked on the same thing many months ago, 
pre-0.8, but it never made it into the trunk.  I'm looking forward to Stefan's 
code now.

Otis




Reply via email to