I simply followed the wiki "The right level of parallelism for maps
seems to be around 10-100 maps/node",
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces
We have 8 cores in each machine, so perhaps 100 mappers ought to be
right, it's set to 157 in the config but hadoop used ~200 for the job,
don't know why. That would of course help in this case, but what about
when we process large datasets? Especially if a mapper fails.
Reducers I also setup to use ~1 per core, slightly less.
/Johan
Ted Dunning wrote:
Why so many mappers and reducers relative to the number of machines you
have? This just causes excess heartache when running the job.
My standard practice is to run with a small factor larger than the number of
cores that I have (for instance 3 tasks on a 2 core machine). In fact, I
find it most helpful to have the cluster defaults rule the choice except in
a few cases where I want one reducer or a few more than the standard 4
reducers.
On 1/15/08 9:15 AM, "Johan Oskarsson" <[EMAIL PROTECTED]> wrote:
Hi.
I believe someone posted about this a while back, but it's worth
mentioning again.
I just ran a job on our 10 node cluster where the input data was
~70 empty sequence files, with our default settings this ran about ~200
mappers and ~70 reducers.
The job took almost exactly two minutes to finish.
How can we reduce this overhead?
* Pick number of mappers and reducers in a more dynamic way,
depending on the size of the input?
* JVM reuse, one jvm per job instead of one per task?
Any other ideas?
/Johan