[ https://issues.apache.org/jira/browse/HADOOP-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716815#action_12716815 ]
Hong Tang commented on HADOOP-5985: ----------------------------------- This is a very good observation and the solution seems interesting too (we observed this behavior in our PetaSort experiment too). However, I think the fundamental problem is still not tackled - where the last few mappers will block all reducers. Even if every mapper is running on a different task tracker, you would have all reducers trying to pull from those few mappers and thus would still be very slow - informing JT to spawn mappers to other TTs would not help (and may make the matter even worse 'coz you may end up not making any progress at all). To really solve the problem, we probably want to run multiple copies of the same mapper and keep them all, then balance the reducers among those replica instances. This is not an easy fix and may belong to the scope of speculative execution. > A single slow (but not dead) map TaskTracker impedes MapReduce progress > ----------------------------------------------------------------------- > > Key: HADOOP-5985 > URL: https://issues.apache.org/jira/browse/HADOOP-5985 > Project: Hadoop Core > Issue Type: Bug > Affects Versions: 0.18.3 > Reporter: Aaron Kimball > > We see cases where there may be a large number of mapper nodes running many > tasks (e.g., a thousand). The reducers will pull 980 of the map task > intermediate files down, but will be unable to retrieve the final > intermediate shards from the last node. The TaskTracker on that node returns > data to reducers either slowly or not at all, but its heartbeat messages make > it back to the JobTracker -- so the JobTracker doesn't mark the tasks as > failed. Manually stopping the offending TaskTracker works to migrate the > tasks to other nodes, where the shuffling process finishes very quickly. Left > on its own, it can take hours to unjam itself otherwise. > We need a mechanism for reducers to provide feedback to the JobTracker that > one of the mapper nodes should be regarded as lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.