Re: On fetcher slowness

Ken Krugler Sat, 12 Aug 2006 18:03:50 -0700

On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,
Several people reported issues with slow fetcher in 0.8...
I run Nutch on a dual CPU (+HT) box, and have noticed that thefetch speed didn't increase when I went from using 100 threads, to200 threads. Has anyone else observed the same?
I was using 2 map tasks (mapred.map.tasks property) in both cases,and the aggregate fetch speed was between 20 and 40 pages/sec.This was a fetch of 50K+ URLs from a diverse set of servers.
While crawling, strace -p<PID> and strace -ff<PID> shows a LOT ofgettimeofday calls. Running strace several times in a row keptshowing that gettimeofday is the most frequent system call.Has anyone tried tracing the fetcher process? Where do these callscome from? Any call to new Date() or Calendar.getInstance(), asmust be done for every single logging call, perhaps?
I can certainly be impolite and lower fetcher.server.delay to 1second or even 0, but I'd like to be polite.
I saw Ken Krugle's email suggesting to increast the number offetcher threads to 2000+ and set the maximal java thread stack sizeto 512k with -Xss. Has anyone other than Ken tried this withsuccess? Wouldn't the JVM go crazy context switching between thismany threads?

Note that most of the time these fetcher threads are all blocked,waiting for other threads that are already fetching from the same IPaddress. So there's not a lot of thrashing.

I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).


Yes, these are definitely problems.

Stefan has been working on a queue-based fetcher that uses NIO. Seemsvery promising, but not yet ready for prime time.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: On fetcher slowness

Reply via email to