Hello fellow Nutchers,
I'm now trying out a "real" crawl, versus the test crawl that I
mentioned in my previous email.
One thing I notice is that my slaves aren't working very hard - I'm
obviously not using the appropriate whips :)
The two slaves are quad processor Xeon 2.8 & 3.0GHz CPUs. The load as
reported by Ganglia is typically about 0.5 (out of 4.0), though
occasionally this spikes to 1.0.
The master (also a quad 3.0GHz) is even more of a slacker,
occasionally spiking to 1.0 out of 4.0, but most of the time doing
nothing as it waits for the slaves to complete their jobs.
I figured as much for the master, but what can I do to get more from my slaves?
Right now I'm using the default settings from the 1/12/2006 build of
Nutch. Interesting ones are:
* mapred.tasktracker.tasks.maximum = 2
* fetcher.threads.fetch = 10
Plus some settings gleaned (I think) from Doug's example:
* mapred.map.tasks = 1000
* mapred.reduce.tasks = 39
* mapred.child.heap.size = 500m
I assume that mapred.reduce.tasks should be 3, not 39, since I've
only got 2 slaves, right?
Should I be boosting mapred.tasktracker.tasks.maximum to 4?
Any other ideas? I'm trying to prepare for another run once this one
has had a chance to generate some interesting results.
Thanks,
-- Ken
--
Ken Krugler