[ http://issues.apache.org/jira/browse/HADOOP-343?page=comments#action_12428260 ] Sameer Paranjpye commented on HADOOP-343: -----------------------------------------
I think this needs further attention. The patch is probably out of date at this point, but the problem is real. I think this may also be responsible for the 'long pause' at the end of the shuffle. If a tasktracker fails, it's map outputs are lost. However, the other tasktrackers are unaware of this. Towards the end of the shuffle they have all map output locations cached and they keep trying to pull data from the lost tasktracker, one file at a time. Every one of these file transfers fails, each failed transfer also causes the tasktrackers to back off from pulling their remaining outputs. The cumulative effect of all the backoffs is the long pause. > In case of dead task tracker, the copy mapouts try copying all mapoutputs > from this tasktracker > ----------------------------------------------------------------------------------------------- > > Key: HADOOP-343 > URL: http://issues.apache.org/jira/browse/HADOOP-343 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.5.0 > Reporter: Mahadev konar > Assigned To: Mahadev konar > Attachments: bugfix.patch > > > In case of a dead task tracker, the reduces which do not have the updated map > out locations try copygin files from this node and since there are failures > on copying, this leads to backoff and slowing down of the copy pahse. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
