[jira] [Updated] (TEZ-1637) Improved shuffle error handling across NM restarts

Rajesh Balamohan (JIRA) Tue, 14 Oct 2014 21:09:06 -0700

     [ 
https://issues.apache.org/jira/browse/TEZ-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated TEZ-1637:
----------------------------------
    Attachment: TEZ-1637.2.patch

>>>>
In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be 
inconsistent. 
>>>>
Fixed this. 

>>>>>
setupConnectionsWithRetry..I think it should just be called setupConnection.
>>>>>
Renamed.

>>>
Any reason to create a new list. Can remaining just be used like the other call.
>>>
"remaining" is LinkedHashSet.  setupConnection() and ShuffleUtils accept 
List<InputAttemptIdentifier>.  Hence the change.

>>>
A custom exception may be a better option. 
>>>
Introduced FetcherReadTimeoutException to address this.

>>>
We're primarily retrying on read errors. When a NodeManager goes down - the 
connection timeout is what is preventing the connection from failing 
immediately ? Assuming that's why we don't need retry logic in place there.
>>>
Yes

> Improved shuffle error handling across NM restarts 
> ---------------------------------------------------
>
>                 Key: TEZ-1637
>                 URL: https://issues.apache.org/jira/browse/TEZ-1637
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1637.1.patch, TEZ-1637.2.patch, TEZ-1637.WIP.patch
>
>
> Similar to MAPREDUCE-5891 :- need to make sure the Tez shufflehandler can 
> handle NM restarts correctly. This is required for rolling upgrades



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1637) Improved shuffle error handling across NM restarts

Reply via email to