[
https://issues.apache.org/jira/browse/HADOOP-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717459#action_12717459
]
Hong Tang commented on HADOOP-5985:
-----------------------------------
AFAIK, once a reducer finishes pulling map output from a particular mapper, it
will no longer pulling the same output again (from a different invocation of
the same map task). If not so, then a reducer cannot quit until all other
reducers finish fetching map outputs from all maps. This then lead to another
implication that if your cluster has fewer reducer slots than total # of
reducers, your job will never finish.
> A single slow (but not dead) map TaskTracker impedes MapReduce progress
> -----------------------------------------------------------------------
>
> Key: HADOOP-5985
> URL: https://issues.apache.org/jira/browse/HADOOP-5985
> Project: Hadoop Core
> Issue Type: Bug
> Affects Versions: 0.18.3
> Reporter: Aaron Kimball
>
> We see cases where there may be a large number of mapper nodes running many
> tasks (e.g., a thousand). The reducers will pull 980 of the map task
> intermediate files down, but will be unable to retrieve the final
> intermediate shards from the last node. The TaskTracker on that node returns
> data to reducers either slowly or not at all, but its heartbeat messages make
> it back to the JobTracker -- so the JobTracker doesn't mark the tasks as
> failed. Manually stopping the offending TaskTracker works to migrate the
> tasks to other nodes, where the shuffling process finishes very quickly. Left
> on its own, it can take hours to unjam itself otherwise.
> We need a mechanism for reducers to provide feedback to the JobTracker that
> one of the mapper nodes should be regarded as lost.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.