[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955541#comment-14955541
 ] 

Siddharth Seth commented on TEZ-2882:
-------------------------------------

In general, the failure handling logic for Shuffle does need looking into, and 
improvement. That requires a fair amount of work though, and there's plenty of 
possibilities. The AM has information about fetch failures originating from a 
host, caused by a host etc - and can make quite a few decisions on which task 
to kill. Currently, it only kills the producer.
Meanwhile, this is a good patch to get in.

Comments on the patch itself.
- Configs - annotate as private, unstable. Also fraction instead of percentage 
in the name
- int threshold = (int) Math.max(1, numUniqueHosts * hostFailurePercentage); - 
should the right argument be a ceil(,).
- Same line - The threshold calculation should probably be different for 
smaller clusters. 1 node should never be enough. Min limit of 3? . Can (-1) be 
treated as a special value for the threshold configuration parameter, to 
completely disable this check ?
- LOG.info("numUniqueHosts=" + numUniqueHosts ... - Requires more context. Does 
the input name need to be logged, or is that part of the thread name. Logging 
which srcAttempt triggered this log line would be useful as well.
- Similarly for the log line in isFetcherHealthy
- if(pathToIdentifierMap.size() == numInputs) { <- Does this break for 
pipelined shuffle ?
- {code}+        if (failedShufflesSinceLastCompletion >=
+            remainingMaps.get() * minFailurePerHost) {code}
I believe this condition is meant to measure pendingHosts instead of 
remainingMaps.get() (considering parallel instances of a host are never 
scheduled). However, remainingMaps works better because of the way the Fetcher 
reports failures. If this is correct - could you please add a comment along 
with the condition. Fetcher reporting failures needs to be looked at along with 
any modifications to this code.
- Same condition - do we need to special case only a single input remaining ?
- healthySinceLastUpdate - This will almost always be true, given the previous 
condition. 
- Should failedAcrossNodes be outside of the condition "failedShuffle > 
remaining * allowed)
- Config param for abortFailureLimit ? (-1 defaults to code, anything else is 
an absolute value). Reducing it to 15 only helps for smaller jobs. numInputs/10 
will take over otherwise.
- Nit: "Preconditions.checkArgument(uniqueHosts.size() > 0, "No values in 
unique hosts");" - can be after numUniqueHosts
- Nit: There's a TODO which needs to be removed

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to