A single slow (but not dead) map TaskTracker impedes MapReduce progress
-----------------------------------------------------------------------

                 Key: HADOOP-5985
                 URL: https://issues.apache.org/jira/browse/HADOOP-5985
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Aaron Kimball


We see cases where there may be a large number of mapper nodes running many 
tasks (e.g., a thousand). The reducers will pull 980 of the map task 
intermediate files down, but will be unable to retrieve the final intermediate 
shards from the last node. The TaskTracker on that node returns data to 
reducers either slowly or not at all, but its heartbeat messages make it back 
to the JobTracker -- so the JobTracker doesn't mark the tasks as failed. 
Manually stopping the offending TaskTracker works to migrate the tasks to other 
nodes, where the shuffling process finishes very quickly. Left on its own, it 
can take hours to unjam itself otherwise.

We need a mechanism for reducers to provide feedback to the JobTracker that one 
of the mapper nodes should be regarded as lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to