[ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated MAPREDUCE-5891: ---------------------------------- Attachment: MAPREDUCE-5891-demo.patch Thanks [~jlowe] for sharing the thoughts. I have a demo patch to work on fetcher side. It haven't been completed as lacking of tests, but it would be great if you can review and provide some feedback on the way I am choosing. In this demo patch, - add retry logic in openConnection (previously, we only have retry logic in real connect). - add retry logic in copyMapOutput. If IOException get throw in tolerant downtime for NM, the retry logic will rebuild connection and skip mapTask that already get shuffled in previous iteration. - refactor some code within copyFromHost() to reuse code as much as possible for concisely purpose. > Improved shuffle error handling across NM restarts > -------------------------------------------------- > > Key: MAPREDUCE-5891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.5.0 > Reporter: Jason Lowe > Assignee: Junping Du > Attachments: MAPREDUCE-5891-demo.patch > > > To minimize the number of map fetch failures reported by reducers across an > NM restart it would be nice if reducers only reported a fetch failure after > trying for at specified period of time to retrieve the data. -- This message was sent by Atlassian JIRA (v6.2#6252)