[jira] [Commented] (TEZ-1637) Improved shuffle error handling across NM restarts

Siddharth Seth (JIRA) Tue, 14 Oct 2014 13:48:06 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171495#comment-14171495
 ]


Siddharth Seth commented on TEZ-1637:
-------------------------------------

Comments on the patch
- In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be 
inconsistent. It's invoked within setupConnectionWithRetry if the fetcher is 
stopped, and will then be re-invoked since setupConnectionWithRetry returns 
false, without clearing the remaining list in between. I'm not sure if this 
behaviour is the same in the broadcast fetcher.

- setupConnectionsWithRetry - Is this doing anything related to retries ? 
Otherwise, I think it should just be called setupConnection.

- {code}if (!setupConnectionsWithRetry(host, new 
LinkedList<InputAttemptIdentifier>(remaining))) {{code} Any reason to create a 
new list. Can remaining just be used like the other call.

- {code}} catch (IOException e) {
          // Setup connection again if disconnected
          cleanupCurrentConnection(true);{code}
A custom exception may be a better option. IOException is thrown in other cases 
like an invalid MapOutput.TYPE, or within the fetch itself. The same applies to 
the broadcast fetcher.

- We're primarily retrying on read errors. When a NodeManager goes down - the 
connection timeout is what is preventing the connection from failing 
immediately ? Assuming that's why we don't need retry logic in place there.

> Improved shuffle error handling across NM restarts 
> ---------------------------------------------------
>
>                 Key: TEZ-1637
>                 URL: https://issues.apache.org/jira/browse/TEZ-1637
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1637.1.patch, TEZ-1637.WIP.patch
>
>
> Similar to MAPREDUCE-5891 :- need to make sure the Tez shufflehandler can 
> handle NM restarts correctly. This is required for rolling upgrades



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-1637) Improved shuffle error handling across NM restarts

Reply via email to