[
https://issues.apache.org/jira/browse/TEZ-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-1637:
----------------------------------
Attachment: TEZ-1637.2.patch
>>>>
In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be
inconsistent.
>>>>
Fixed this.
>>>>>
setupConnectionsWithRetry..I think it should just be called setupConnection.
>>>>>
Renamed.
>>>
Any reason to create a new list. Can remaining just be used like the other call.
>>>
"remaining" is LinkedHashSet. setupConnection() and ShuffleUtils accept
List<InputAttemptIdentifier>. Hence the change.
>>>
A custom exception may be a better option.
>>>
Introduced FetcherReadTimeoutException to address this.
>>>
We're primarily retrying on read errors. When a NodeManager goes down - the
connection timeout is what is preventing the connection from failing
immediately ? Assuming that's why we don't need retry logic in place there.
>>>
Yes
> Improved shuffle error handling across NM restarts
> ---------------------------------------------------
>
> Key: TEZ-1637
> URL: https://issues.apache.org/jira/browse/TEZ-1637
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1637.1.patch, TEZ-1637.2.patch, TEZ-1637.WIP.patch
>
>
> Similar to MAPREDUCE-5891 :- need to make sure the Tez shufflehandler can
> handle NM restarts correctly. This is required for rolling upgrades
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)