Hi, ----- Original Message ---- From: Ken Krugler <[EMAIL PROTECTED]>
>On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>Hello, >> >>Several people reported issues with slow fetcher in 0.8... >> >>I run Nutch on a dual CPU (+HT) box, and have noticed that the >>fetch speed didn't increase when I went from using 100 threads, to >>200 threads. Has anyone else observed the same? >> >>I was using 2 map tasks (mapred.map.tasks property) in both cases, >>and the aggregate fetch speed was between 20 and 40 pages/sec. >>This was a fetch of 50K+ URLs from a diverse set of servers. <snip> >>I saw Ken Krugle's email suggesting to increast the number of >>fetcher threads to 2000+ and set the maximal java thread stack size >>to 512k with -Xss. Has anyone other than Ken tried this with >>success? Wouldn't the JVM go crazy context switching between this >>many threads? Note that most of the time these fetcher threads are all blocked, waiting for other threads that are already fetching from the same IP address. So there's not a lot of thrashing. OG: I see. But wouldn't that be true only in case of more vertical crawls, crawls that don't have a very large and diverse set of hosts? In other words, if you are doing a web-wide crawl, each of those 2000 fetcher threads is very likely to be assigned to a host/IP that is currently not being crawled by any other fetcher thread, no? >I been working with 512k -Xss (Ken Krugle's suggestion) and it works >well. However number of fetcher for my part is 2500+.. I had to play >around with this number to match my bandwidth limitation, but now I >maximize my full bandwidth. But the problem that I run into are the >fetcher threads hangs, and for crawl delay/robots.txt file (Please see >Dennis Kubes posting on this). Yes, these are definitely problems. Stefan has been working on a queue-based fetcher that uses NIO. Seems very promising, but not yet ready for prime time. OG: yeah, I saw his email. Kelvin worked on the same thing many months ago, pre-0.8, but it never made it into the trunk. I'm looking forward to Stefan's code now. Otis
