[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5891:
----------------------------------

    Attachment: MAPREDUCE-5891-demo.patch

Thanks [~jlowe] for sharing the thoughts. I have a demo patch to work on 
fetcher side. It haven't been completed as lacking of tests, but it would be 
great if you can review and provide some feedback on the way I am choosing.
In this demo patch,
- add retry logic in openConnection (previously, we only have retry logic in 
real connect).
- add retry logic in copyMapOutput. If IOException get throw in tolerant 
downtime for NM, the retry logic will rebuild connection and skip mapTask that 
already get shuffled in previous iteration.
- refactor some code within copyFromHost() to reuse code as much as possible 
for concisely purpose. 

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to