[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120850#comment-14120850
 ] 

Junping Du commented on MAPREDUCE-5891:
---------------------------------------

Thanks for comments, [~mingma] and [~jlowe]!
bq. In the case slowstart is set to some small value, the reducer will fetch 
some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime 
is set to some old value due to early NM host A restart, and thus mark fetcher 
retry timed out when it later tries to handle NM host B restart?
Nice catch! Fixed as Jason's suggestion below.

bq. so I think this just adds to the log length without adding a lot of 
valuable information
Agree. remove the log.

bq. Nit: The following code should simply be retryStartTime = 0;
Fixed.

bq. setupConnectionsWithRetry is now inconsistent when it comes to calling 
abortConnect() when stopped is true.
Good point. Fixed.

Also, agree above comments from [~jlowe] on YARN-914 and YARN-1593. 

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, 
> MAPREDUCE-5891-v3.patch, MAPREDUCE-5891-v4.patch, MAPREDUCE-5891.patch
>
>
> To minimize the number of map fetch failures reported by reducers across an 
> NM restart it would be nice if reducers only reported a fetch failure after 
> trying for at specified period of time to retrieve the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to