I would still like to see some of these site preferences be more
dynamic. For instance, I will soon be using both single CPU and dual
CPU machines, with varying amounts of RAM. I'd happily have an extra
job or 2 scheduled on the dual CPU machines, to keep them utilized and
take better advantage of the RAM (which is mostly serving as disk
cache for my current loads). But, there's no way to set a different
tasks.maximum for each node (or a concept of "class of node") at this
point. If I set the value too high - tasks are going to be more likely
to fail on the lower-class nodes. Too low, and I won't use the whole
cluster effectively.

Adapting to variability of resource is still a big problem across
hadoop. Performance still drops off very rapidly in many cases if you
have a weak node - there's no speculative reduce execution, bugs in
speculative map execution, bad handling of filled-up space during DFS
writes, as well as MapOutputFile writes. In fact, anything that calls
"getLocalPath" gets uniformly spread across available drives, with no
"full" checking - filling up any one drive on the entire cluster can
cause all kinds of things to fail.

On 5/25/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:
There are a few parameters that would need to be set.
mapred.tasktracker.tasks.maximum specifies the maximum number of tasks
per task tracker.  mapred.map.tasks sets the default number of map tasks
per job.  Usually this is set to a multiple to the number of processors
you have.  So if you have 5 nodes each with 4 cores you can set
mapred.map.tasks to something like 100 (5 * 4 = 20 * 5 = 100) where we
would run 5 tasks on each processor simultaneously.
mapred.tasktracker.tasks.maximum would be set to say 25 (more than the
20 tasks per node / tasktracker).

Those settings would configure tasks running but there are some other
things to consider.  First mapred.map.task sets the default number of
tasks meaning each job is broken into about that many number of tasks
(usually that or a little more).  You may not want some tasks to run
broken up into that many pieces because it takes longer to break up the
task into say 100 pieces and process each piece then it would to say
break it up into 5 pieces and run it.  So consider if the task is big
enough to warrant the overhead.  Also there are settings such as
mapred.submit.replication , mapred.speculative.execution, and
mapred.reduce.parallel.copies which can be tuned to make the entire
process run faster.

Try this and see if it gives you the results you are looking for.  To
address running multiple tasktrackers per node, you can do that but you
would have to modify the start-all.sh and stop-all.sh scripts to be able
to start and stop the multiple trackers and you would probably need
different install paths and configurations (hadoop-site.xml files) for
each tasktracker as there are pid files to be concerned with.
Personally I think that is a more difficult way to proceed.

Dennis

--
Bryan A. Pendleton
Ph: (877) geek-1-bp

Reply via email to