Re: Multiple tasktrackers per node

Dennis Kubes Thu, 25 May 2006 08:25:29 -0700

There are a few parameters that would need to be set.mapred.tasktracker.tasks.maximum specifies the maximum number of tasksper task tracker. mapred.map.tasks sets the default number of map tasksper job. Usually this is set to a multiple to the number of processorsyou have. So if you have 5 nodes each with 4 cores you can setmapred.map.tasks to something like 100 (5 * 4 = 20 * 5 = 100) where wewould run 5 tasks on each processor simultaneously.mapred.tasktracker.tasks.maximum would be set to say 25 (more than the20 tasks per node / tasktracker).

Those settings would configure tasks running but there are some otherthings to consider. First mapred.map.task sets the default number oftasks meaning each job is broken into about that many number of tasks(usually that or a little more). You may not want some tasks to runbroken up into that many pieces because it takes longer to break up thetask into say 100 pieces and process each piece then it would to saybreak it up into 5 pieces and run it. So consider if the task is bigenough to warrant the overhead. Also there are settings such asmapred.submit.replication , mapred.speculative.execution, andmapred.reduce.parallel.copies which can be tuned to make the entireprocess run faster.

Try this and see if it gives you the results you are looking for. Toaddress running multiple tasktrackers per node, you can do that but youwould have to modify the start-all.sh and stop-all.sh scripts to be ableto start and stop the multiple trackers and you would probably needdifferent install paths and configurations (hadoop-site.xml files) foreach tasktracker as there are pid files to be concerned with.Personally I think that is a more difficult way to proceed.


Dennis

Gianlorenzo Thione wrote:

Thanks for the answer. So far I am still trying to understand how eachtasktracker gets multiple map or reduce tasks to be executedsimultaneously. I have run a simple job with 53 map tasks on 5 nodes,and at all times each node was executing a single task. Each clusternode is a 4 core machine, so theoretically this was a 16-node clusterand I feel that the resources were actually underutilized. Am Imissing something? Is there a parameter for a minimum number of tasksto be executed in parallel (I found a parameter for setting a maximum[which I set to 4])? If I run 4 TaskTrackers per node then each nodegets a map task at the same time and execution seems overall much faster.
I'd appreciate help and insights with respect to this matter.Eventually each map task in our application will synchronize with anexternal single-threaded cpu-intensive process to process data (thususing the tasktracker as a driver for these processes). We need tomake sure that each node is utilized at its maximum capacity at alltimes by running 4 instances of those single-threaded processes and inorder to achieve that we'd need each TaskTracker being handed onaverage 4 map jobs at a time, each to be run concurrently in adifferent thread. Is there a way to guarantee that this happen? Inalternative we can always run 4 TaskTracker per node, which was ouroriginal plan, but if there are better/smarter way to do this, thatwould be the best solution.
Thanks in advance!

Lorenzo Thione

On May 24, 2006, at 7:31 AM, Dennis Kubes wrote:
Using Java 5 will allow the threads of various tasks to takeadvantage of multiple processors. Just make sure you set you maptasks property to a multiple of the number of processors total. Weare running multi-core machines and are seeing good utilizationacross all cores this way.
Dennis



Gianlorenzo Thione wrote:
Hello everybody,
I'll ask my first question on this forum and hopefully startbuilding more and more understanding of hadoop so that we caneventually contribute actively. In the meanwhile, I have a simpleissue/question/suggestion....
I have many multi-core, multi-processor nodes in my cluster and I'dlike to be able to run several tasktrackers and datanode perphysical machine. I am modifying the startup scripts so that anumber of worker JVMs can be started on each node, maxed out at thenumber of CPUs seen by the kernel.
Since our map jobs are highly CPU intensive it makes sense to runparallel jobs on each node, maximizing the CPU utilization.
Is that something that would make sense to roll back in the scriptsfor hadoop as well? Anybody else running on multi processorarchitectures?
Lorenzo Thione
Powerset, Inc.

Re: Multiple tasktrackers per node

Reply via email to