Seeking Insight into Nutch Configurations

Scott Gonyea Mon, 02 Aug 2010 01:19:04 -0700

Hi, I've been digging through Google and the archives quite thoroughly, to 
little avail. Please excuse any grammar mistakes; I just moved and lack 
Internet for my laptop.


The big problem that I am facing, thus far, occurs on the 4th fetch. All but 1 
or 2 maps complete. All of the running reduces stall (0.00 MB/s), presumably 
because they are waiting on that map to finish? I really don't know and it's 
frustrating.

I've been playing heavily with the formula, but however many maps/reduces I set 
in mapred-site, it has the same outcome.

I've created dozens of hadoop AMIs that have had tweaks in the following ranges:
Memory assigned: 512m-2048m
Fetcher threads: 64-1024 (King of the DoS!)
Tracker Concurrent Maps: 1-32
Jobtracker Total Maps: 11(1/node)-1091~
Tracker Concurrent Reduces: 1-32
Jobtracker Total Reduces: 11(1/node)-1091~

There are more and I'll share some of my conf files once I'm able to do so. I 
would sincerely appreciate some insight into how to configure the various 
settings in Nutch/Hadoop.

My scenario:
# Sites: 10,000-30,000 per crawl
Depth: ~5
Content: Text is all that I care for. (HTML/RSS/XML)
Nodes: Amazon EC2 (ugh)
Storage: I've performed crawls with HDFS and with amazon S3. I thought S3 would 
be more performant, yet it doesn't appear to affect matters.
Cost vs Speed: I don't mind throwing EC2 instances at this to get it done 
quickly... But I can't imagine I need much more than 10-20 mid-size instances 
for this.

Can anyone share their own experiences in the performance they've seen?

Thank you very much,
Scott Gonyea

Seeking Insight into Nutch Configurations

Reply via email to