[
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junping Du updated MAPREDUCE-5891:
----------------------------------
Attachment: MAPREDUCE-5891-v2.patch
Thanks [~jlowe] for review and comments! In v2 patch, I addressed all your
comments.
bq. We are retrying one more time when we're past the retry timeout which could
result in a significantly longer time to discover fetch failures that aren't NM
restart-related. This is also inconsistent with how openConnectionWithRetry
behaves.
Nice catch. Move timeout judgement inside of copyMapOutput to see if throw
exception for retry (before timeout) or get failed (reach to or after timeout).
> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
> Key: MAPREDUCE-5891
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 2.5.0
> Reporter: Jason Lowe
> Assignee: Junping Du
> Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch,
> MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an
> NM restart it would be nice if reducers only reported a fetch failure after
> trying for at specified period of time to retrieve the data.
--
This message was sent by Atlassian JIRA
(v6.2#6252)