>On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>Hello, >> >>Several people reported issues with slow fetcher in 0.8... >> >>I run Nutch on a dual CPU (+HT) box, and have noticed that the >>fetch speed didn't increase when I went from using 100 threads, to >>200 threads. Has anyone else observed the same? >> >>I was using 2 map tasks (mapred.map.tasks property) in both cases, >>and the aggregate fetch speed was between 20 and 40 pages/sec. >>This was a fetch of 50K+ URLs from a diverse set of servers. >> >>While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of >>gettimeofday calls. Running strace several times in a row kept >>showing that gettimeofday is the most frequent system call. >>Has anyone tried tracing the fetcher process? Where do these calls >>come from? Any call to new Date() or Calendar.getInstance(), as >>must be done for every single logging call, perhaps? >> >>I can certainly be impolite and lower fetcher.server.delay to 1 >>second or even 0, but I'd like to be polite. >> >>I saw Ken Krugle's email suggesting to increast the number of >>fetcher threads to 2000+ and set the maximal java thread stack size >>to 512k with -Xss. Has anyone other than Ken tried this with >>success? Wouldn't the JVM go crazy context switching between this >>many threads?
Note that most of the time these fetcher threads are all blocked, waiting for other threads that are already fetching from the same IP address. So there's not a lot of thrashing. >I been working with 512k -Xss (Ken Krugle's suggestion) and it works >well. However number of fetcher for my part is 2500+.. I had to play >around with this number to match my bandwidth limitation, but now I >maximize my full bandwidth. But the problem that I run into are the >fetcher threads hangs, and for crawl delay/robots.txt file (Please see >Dennis Kubes posting on this). Yes, these are definitely problems. Stefan has been working on a queue-based fetcher that uses NIO. Seems very promising, but not yet ready for prime time. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
