[ 
https://issues.apache.org/jira/browse/HADOOP-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717332#action_12717332
 ] 

Aaron Kimball commented on HADOOP-5985:
---------------------------------------

Hong,

This changes the semantics of speculative execution, as I understand it.

Speculative execution does not strictly guarantee that all mappers will emit 
the same output for the same input, but it does guarantee that they are all 
equally "good." So map(A) might return X, but a second speculative execution of 
map(A) might return Y. Either X or Y will finish first and the JT will use 
exactly one of these.

Your proposal is that some of the reducers can grab their output shard from the 
X results, and other reducers can grab their output shard from the Y results.

If we're willing to tell developers about that new contract and make it an 
option, then wouldn't be universally applicable. So I still think we'd still 
need a fallback mechanism like I proposed here.

> A single slow (but not dead) map TaskTracker impedes MapReduce progress
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-5985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5985
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Aaron Kimball
>
> We see cases where there may be a large number of mapper nodes running many 
> tasks (e.g., a thousand). The reducers will pull 980 of the map task 
> intermediate files down, but will be unable to retrieve the final 
> intermediate shards from the last node. The TaskTracker on that node returns 
> data to reducers either slowly or not at all, but its heartbeat messages make 
> it back to the JobTracker -- so the JobTracker doesn't mark the tasks as 
> failed. Manually stopping the offending TaskTracker works to migrate the 
> tasks to other nodes, where the shuffling process finishes very quickly. Left 
> on its own, it can take hours to unjam itself otherwise.
> We need a mechanism for reducers to provide feedback to the JobTracker that 
> one of the mapper nodes should be regarded as lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to