[ 
https://issues.apache.org/jira/browse/HADOOP-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612593#action_12612593
 ] 

Spyros Blanas commented on HADOOP-3738:
---------------------------------------

If all ideas described in HADOOP-3136 are implemented, this patch becomes 
useless. However, HADOOP-3136 is an open-ended suggestion: 
* It suggests changing the protocol between clients and servers to add more 
information, but it doesn't go into details of what should actually be 
transmitted. (number of total free slots? number of free map and reduce slots? 
more statistics?)
* It suggests changing the task selection logic to do something "better" and 
gives some good examples, but how we can design a task assignment strategy out 
of these examples is still not clear.

My patch addresses issue 1 by not requiring any changes to the protocol. Issue 
2 is handled by calling the existing code more frequently, when it is required.

The price is slightly higher network traffic for periods where there might be 
work to do, but hearbeat messages are "light" compared to the other network 
messages used in DFS communication or the HTTP connections during the shuffle 
stage. In the worst scenario, 1 extra heartbeat will be sent.

HADOOP-3136 suggests a more elegant approach, but which requires significant 
changes. I've written this patch as a first try to do something smarter than 
the current strategy (agreeing with the suggestions in HADOOP-3136) but without 
hitting on the complex issue of changing the protocol and redesigning the task 
assignment algorithm.

> Spawning tasks faster
> ---------------------
>
>                 Key: HADOOP-3738
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3738
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Spyros Blanas
>            Priority: Minor
>         Attachments: dynamic_heartbeat.patch
>
>
> In the current implementation, tasks are assigned to tasktrackers by adding 
> an appropriate action to the heartbeat response list. Each heartbeat response 
> can start one task. As the minimum interval between heartbeats is 5 sec (by 
> default), if the nodes are strong machines (say, each node has 10 task 
> "slots") and the cluster is idle, this means that some tasks are spawned 
> after some time (in our example, the last task will be spawned after 45 
> seconds).
> This can be significantly improve the end-to-end execution time if most jobs 
> are finished in the order of minutes.
> The patch I attach requests from each TaskTracker to reply in 1/5th of the 
> regular heartbeat interval time if it was assigned a task in this round, 
> making spawning of multiple tasks much more efficient.
> A better approach would be to have each TaskTracker report the number of free 
> slots it has (instead of only if it can accept more work or not) and have the 
> JobTracker push the appropriate number of tasks in one response, but this 
> will require changes in the current communication protocol.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to