On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,
Several people reported issues with slow fetcher in 0.8...
I run Nutch on a dual CPU (+HT) box, and have noticed that the
fetch speed didn't increase when I went from using 100 threads, to
200 threads. Has anyone else observed the same?
I was using 2 map tasks (mapred.map.tasks property) in both cases,
and the aggregate fetch speed was between 20 and 40 pages/sec.
This was a fetch of 50K+ URLs from a diverse set of servers.
While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of
gettimeofday calls. Running strace several times in a row kept
showing that gettimeofday is the most frequent system call.
Has anyone tried tracing the fetcher process? Where do these calls
come from? Any call to new Date() or Calendar.getInstance(), as
must be done for every single logging call, perhaps?
I can certainly be impolite and lower fetcher.server.delay to 1
second or even 0, but I'd like to be polite.
I saw Ken Krugle's email suggesting to increast the number of
fetcher threads to 2000+ and set the maximal java thread stack size
to 512k with -Xss. Has anyone other than Ken tried this with
success? Wouldn't the JVM go crazy context switching between this
many threads?
Note that most of the time these fetcher threads are all blocked,
waiting for other threads that are already fetching from the same IP
address. So there's not a lot of thrashing.
I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).
Yes, these are definitely problems.
Stefan has been working on a queue-based fetcher that uses NIO. Seems
very promising, but not yet ready for prime time.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"