[
https://issues.apache.org/jira/browse/TEZ-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171495#comment-14171495
]
Siddharth Seth commented on TEZ-1637:
-------------------------------------
Comments on the patch
- In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be
inconsistent. It's invoked within setupConnectionWithRetry if the fetcher is
stopped, and will then be re-invoked since setupConnectionWithRetry returns
false, without clearing the remaining list in between. I'm not sure if this
behaviour is the same in the broadcast fetcher.
- setupConnectionsWithRetry - Is this doing anything related to retries ?
Otherwise, I think it should just be called setupConnection.
- {code}if (!setupConnectionsWithRetry(host, new
LinkedList<InputAttemptIdentifier>(remaining))) {{code} Any reason to create a
new list. Can remaining just be used like the other call.
- {code}} catch (IOException e) {
// Setup connection again if disconnected
cleanupCurrentConnection(true);{code}
A custom exception may be a better option. IOException is thrown in other cases
like an invalid MapOutput.TYPE, or within the fetch itself. The same applies to
the broadcast fetcher.
- We're primarily retrying on read errors. When a NodeManager goes down - the
connection timeout is what is preventing the connection from failing
immediately ? Assuming that's why we don't need retry logic in place there.
> Improved shuffle error handling across NM restarts
> ---------------------------------------------------
>
> Key: TEZ-1637
> URL: https://issues.apache.org/jira/browse/TEZ-1637
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1637.1.patch, TEZ-1637.WIP.patch
>
>
> Similar to MAPREDUCE-5891 :- need to make sure the Tez shufflehandler can
> handle NM restarts correctly. This is required for rolling upgrades
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)