Re: On fetcher slowness

Zaheed Haque Sat, 12 Aug 2006 00:05:28 -0700

On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Hello,


Several people reported issues with slow fetcher in 0.8...

I run Nutch on a dual CPU (+HT) box, and have noticed that the fetch speed 
didn't increase when I went from using 100 threads, to 200 threads.  Has anyone 
else observed the same?

I was using 2 map tasks (mapred.map.tasks property) in both cases, and the 
aggregate fetch speed was between 20 and 40 pages/sec.  This was a fetch of 
50K+ URLs from a diverse set of servers.

While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of gettimeofday 
calls.  Running strace several times in a row kept showing that gettimeofday is the most 
frequent system call.
Has anyone tried tracing the fetcher process?  Where do these calls come from?  
Any call to new Date() or Calendar.getInstance(), as must be done for every 
single logging call, perhaps?

I can certainly be impolite and lower fetcher.server.delay to 1 second or even 
0, but I'd like to be polite.

I saw Ken Krugle's email suggesting to increast the number of fetcher threads 
to 2000+ and set the maximal java thread stack size to 512k with -Xss.  Has 
anyone other than Ken tried this with success?  Wouldn't the JVM go crazy 
context switching between this many threads?


I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).

I been testing Fetcher2 (Nutch-339) with about 2M + URLs and I got
good result. I had some trouble in the begining but now works good.
Note this solves the crawl delay problem but I still need to apply
Stack size chnages.

Cheers
Zaheed

Re: On fetcher slowness

Reply via email to