I would still like to see some of these site preferences be more dynamic. For instance, I will soon be using both single CPU and dual CPU machines, with varying amounts of RAM. I'd happily have an extra job or 2 scheduled on the dual CPU machines, to keep them utilized and take better advantage of the RAM (which is mostly serving as disk cache for my current loads). But, there's no way to set a different tasks.maximum for each node (or a concept of "class of node") at this point. If I set the value too high - tasks are going to be more likely to fail on the lower-class nodes. Too low, and I won't use the whole cluster effectively.
Adapting to variability of resource is still a big problem across hadoop. Performance still drops off very rapidly in many cases if you have a weak node - there's no speculative reduce execution, bugs in speculative map execution, bad handling of filled-up space during DFS writes, as well as MapOutputFile writes. In fact, anything that calls "getLocalPath" gets uniformly spread across available drives, with no "full" checking - filling up any one drive on the entire cluster can cause all kinds of things to fail. On 5/25/06, Dennis Kubes <[EMAIL PROTECTED]> wrote:
There are a few parameters that would need to be set. mapred.tasktracker.tasks.maximum specifies the maximum number of tasks per task tracker. mapred.map.tasks sets the default number of map tasks per job. Usually this is set to a multiple to the number of processors you have. So if you have 5 nodes each with 4 cores you can set mapred.map.tasks to something like 100 (5 * 4 = 20 * 5 = 100) where we would run 5 tasks on each processor simultaneously. mapred.tasktracker.tasks.maximum would be set to say 25 (more than the 20 tasks per node / tasktracker). Those settings would configure tasks running but there are some other things to consider. First mapred.map.task sets the default number of tasks meaning each job is broken into about that many number of tasks (usually that or a little more). You may not want some tasks to run broken up into that many pieces because it takes longer to break up the task into say 100 pieces and process each piece then it would to say break it up into 5 pieces and run it. So consider if the task is big enough to warrant the overhead. Also there are settings such as mapred.submit.replication , mapred.speculative.execution, and mapred.reduce.parallel.copies which can be tuned to make the entire process run faster. Try this and see if it gives you the results you are looking for. To address running multiple tasktrackers per node, you can do that but you would have to modify the start-all.sh and stop-all.sh scripts to be able to start and stop the multiple trackers and you would probably need different install paths and configurations (hadoop-site.xml files) for each tasktracker as there are pid files to be concerned with. Personally I think that is a more difficult way to proceed. Dennis
-- Bryan A. Pendleton Ph: (877) geek-1-bp
