Hi, I've been digging through Google and the archives quite thoroughly, to little avail. Please excuse any grammar mistakes; I just moved and lack Internet for my laptop.
The big problem that I am facing, thus far, occurs on the 4th fetch. All but 1 or 2 maps complete. All of the running reduces stall (0.00 MB/s), presumably because they are waiting on that map to finish? I really don't know and it's frustrating. I've been playing heavily with the formula, but however many maps/reduces I set in mapred-site, it has the same outcome. I've created dozens of hadoop AMIs that have had tweaks in the following ranges: Memory assigned: 512m-2048m Fetcher threads: 64-1024 (King of the DoS!) Tracker Concurrent Maps: 1-32 Jobtracker Total Maps: 11(1/node)-1091~ Tracker Concurrent Reduces: 1-32 Jobtracker Total Reduces: 11(1/node)-1091~ There are more and I'll share some of my conf files once I'm able to do so. I would sincerely appreciate some insight into how to configure the various settings in Nutch/Hadoop. My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh) Storage: I've performed crawls with HDFS and with amazon S3. I thought S3 would be more performant, yet it doesn't appear to affect matters. Cost vs Speed: I don't mind throwing EC2 instances at this to get it done quickly... But I can't imagine I need much more than 10-20 mid-size instances for this. Can anyone share their own experiences in the performance they've seen? Thank you very much, Scott Gonyea

