A single slow (but not dead) map TaskTracker impedes MapReduce progress
-----------------------------------------------------------------------
Key: HADOOP-5985
URL: https://issues.apache.org/jira/browse/HADOOP-5985
Project: Hadoop Core
Issue Type: Bug
Reporter: Aaron Kimball
We see cases where there may be a large number of mapper nodes running many
tasks (e.g., a thousand). The reducers will pull 980 of the map task
intermediate files down, but will be unable to retrieve the final intermediate
shards from the last node. The TaskTracker on that node returns data to
reducers either slowly or not at all, but its heartbeat messages make it back
to the JobTracker -- so the JobTracker doesn't mark the tasks as failed.
Manually stopping the offending TaskTracker works to migrate the tasks to other
nodes, where the shuffling process finishes very quickly. Left on its own, it
can take hours to unjam itself otherwise.
We need a mechanism for reducers to provide feedback to the JobTracker that one
of the mapper nodes should be regarded as lost.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.