On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,
Several people reported issues with slow fetcher in 0.8...
I run Nutch on a dual CPU (+HT) box, and have noticed that the fetch speed
didn't increase when I went from using 100 threads, to 200 threads. Has anyone
else observed the same?
I was using 2 map tasks (mapred.map.tasks property) in both cases, and the
aggregate fetch speed was between 20 and 40 pages/sec. This was a fetch of
50K+ URLs from a diverse set of servers.
While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of gettimeofday
calls. Running strace several times in a row kept showing that gettimeofday is the most
frequent system call.
Has anyone tried tracing the fetcher process? Where do these calls come from?
Any call to new Date() or Calendar.getInstance(), as must be done for every
single logging call, perhaps?
I can certainly be impolite and lower fetcher.server.delay to 1 second or even
0, but I'd like to be polite.
I saw Ken Krugle's email suggesting to increast the number of fetcher threads
to 2000+ and set the maximal java thread stack size to 512k with -Xss. Has
anyone other than Ken tried this with success? Wouldn't the JVM go crazy
context switching between this many threads?
I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).
I been testing Fetcher2 (Nutch-339) with about 2M + URLs and I got
good result. I had some trouble in the begining but now works good.
Note this solves the crawl delay problem but I still need to apply
Stack size chnages.
Cheers
Zaheed