>On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>Hello,
>>
>>Several people reported issues with slow fetcher in 0.8...
>>
>>I run Nutch on a dual CPU (+HT) box, and have noticed that the 
>>fetch speed didn't increase when I went from using 100 threads, to 
>>200 threads.  Has anyone else observed the same?
>>
>>I was using 2 map tasks (mapred.map.tasks property) in both cases, 
>>and the aggregate fetch speed was between 20 and 40 pages/sec. 
>>This was a fetch of 50K+ URLs from a diverse set of servers.
>>
>>While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of 
>>gettimeofday calls.  Running strace several times in a row kept 
>>showing that gettimeofday is the most frequent system call.
>>Has anyone tried tracing the fetcher process?  Where do these calls 
>>come from?  Any call to new Date() or Calendar.getInstance(), as 
>>must be done for every single logging call, perhaps?
>>
>>I can certainly be impolite and lower fetcher.server.delay to 1 
>>second or even 0, but I'd like to be polite.
>>
>>I saw Ken Krugle's email suggesting to increast the number of 
>>fetcher threads to 2000+ and set the maximal java thread stack size 
>>to 512k with -Xss.  Has anyone other than Ken tried this with 
>>success?  Wouldn't the JVM go crazy context switching between this 
>>many threads?

Note that most of the time these fetcher threads are all blocked, 
waiting for other threads that are already fetching from the same IP 
address. So there's not a lot of thrashing.

>I been working with 512k -Xss (Ken Krugle's suggestion) and it works
>well. However number of fetcher for my part is 2500+.. I had to play
>around with this number to match my bandwidth limitation, but now I
>maximize my full bandwidth. But the problem that I run into are the
>fetcher threads hangs, and for crawl delay/robots.txt file (Please see
>Dennis Kubes posting on this).

Yes, these are definitely problems.

Stefan has been working on a queue-based fetcher that uses NIO. Seems 
very promising, but not yet ready for prime time.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to