[
https://issues.apache.org/jira/browse/MAPREDUCE-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071958#comment-13071958
]
Jonathan Eagles commented on MAPREDUCE-1060:
--------------------------------------------
Analysis of this issue. While speculative execution plays a factor in
increasing the likelihood of this issue, it is not the full root cause.
Instead, any "lost tracker" event will fail all tasks on the lost tracker,
causing a restart of all tasks from non-complete jobs run on the tracker. For
map tasks with both map and reduce tasks, the restarted map tasks restart may
not be necessary depending if all the reduce tasks have already copied their
data from the lost tracker. This issue is further complicated for jobs with
speculative reduce tasks and reduce task failures since it can not be known at
the time of the "lost tracker" event whether all the reduce tasks have copied
their data yet.
This patch addresses the symptom of map tasks running long after the reduce
tasks are finished by allowing the restart of the map task. And later killing
the restarted map task if it was not needed, recovering the previously
successful map task attempt from the "lost tracker" in its place.
This allows jobs to run no longer than they have to in the "lost tracker"
scenario.
> JT should kill running maps when all the reducers have completed
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-1060
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1060
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Jothi Padmanabhan
> Assignee: Jonathan Eagles
> Fix For: 0.20.205.0
>
> Attachments: MAPREDUCE-1060-branch-0.20-security.patch
>
>
> We have seen some situations where maps are still running when all the
> reducers have completed. This could happen because of lost TT's, interplay of
> speculative tasks with bad TT's etc. If the maps take a long time to run, it
> unnecessarily delays the job completion time, as this map output is not
> required anyways. The JT should possibly kill running maps when all the
> reducers have completed.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira