[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-5891:
----------------------------------

    Attachment: MAPREDUCE-5891-v2.patch

Thanks [~jlowe] for review and comments! In v2 patch, I addressed all your 
comments.
bq. We are retrying one more time when we're past the retry timeout which could 
result in a significantly longer time to discover fetch failures that aren't NM 
restart-related. This is also inconsistent with how openConnectionWithRetry 
behaves.
Nice catch. Move timeout judgement inside of copyMapOutput to see if throw 
exception for retry (before timeout) or get failed (reach to or after timeout).

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to