On Sep 25, 2007, at 10:09 AM, Michael Bieniosek wrote:
For our CPU-bound application, I set the value of
mapred.tasktracker.tasks.maximum (number of map tasks per
tasktracker) equal to the number of CPUs on a tasktracker.
Unfortunately, I think this value has to be set per cluster, not
per machine. This is okay for us because our machines have similar
hardware, but it might be a problem if your machines have different
numbers of CPUs.
I did some experimentation with the number of tasks per machine on a
set of quad core boxes. I couldn't figure out how to change this
value without stopping the cluster and restarting it, and I also
couldn't figure out how to tune it on a per machine basis (though it
didn't matter much for me either).
My test had no reduce phase, so I simply set the reduce count to 1
per machine for all the tests. On the quad core boxes, 5 map tasks
per machine actually performed the best, but only marginally better
than 4 map tasks (about 4% with just one box in the cluster, 2% with
4 boxes). Six tasks started to trend back in the other direction.
I created HADOOP-1245 a long time ago for this problem, but I've
since heard that hadoop uses only the cluster value for maps per
tasktracker, not the hybrid model I describe. In any case, I never
did any work on fixing it because I don't need heterogeneous clusters.
-Michael
On 9/25/07 9:37 AM, "Ted Dunning" <[EMAIL PROTECTED]> wrote:
On 9/25/07 9:27 AM, "Bob Futrelle" <[EMAIL PROTECTED]> wrote:
How does Hadoop handle multi-core CPUs? Does each core run a
distinct copy
of the mapped app? Is this automatic, or need some configuration,
or what?
Works fine. You need to tell it how many maps to run per machine.
I expect
that this can be tuned per machine.
Or should I just spread Hadoop over some friendly machines already
in my
College, buying nothing?
Or both? You will get interesting results all three ways.