[
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955541#comment-14955541
]
Siddharth Seth commented on TEZ-2882:
-------------------------------------
In general, the failure handling logic for Shuffle does need looking into, and
improvement. That requires a fair amount of work though, and there's plenty of
possibilities. The AM has information about fetch failures originating from a
host, caused by a host etc - and can make quite a few decisions on which task
to kill. Currently, it only kills the producer.
Meanwhile, this is a good patch to get in.
Comments on the patch itself.
- Configs - annotate as private, unstable. Also fraction instead of percentage
in the name
- int threshold = (int) Math.max(1, numUniqueHosts * hostFailurePercentage); -
should the right argument be a ceil(,).
- Same line - The threshold calculation should probably be different for
smaller clusters. 1 node should never be enough. Min limit of 3? . Can (-1) be
treated as a special value for the threshold configuration parameter, to
completely disable this check ?
- LOG.info("numUniqueHosts=" + numUniqueHosts ... - Requires more context. Does
the input name need to be logged, or is that part of the thread name. Logging
which srcAttempt triggered this log line would be useful as well.
- Similarly for the log line in isFetcherHealthy
- if(pathToIdentifierMap.size() == numInputs) { <- Does this break for
pipelined shuffle ?
- {code}+ if (failedShufflesSinceLastCompletion >=
+ remainingMaps.get() * minFailurePerHost) {code}
I believe this condition is meant to measure pendingHosts instead of
remainingMaps.get() (considering parallel instances of a host are never
scheduled). However, remainingMaps works better because of the way the Fetcher
reports failures. If this is correct - could you please add a comment along
with the condition. Fetcher reporting failures needs to be looked at along with
any modifications to this code.
- Same condition - do we need to special case only a single input remaining ?
- healthySinceLastUpdate - This will almost always be true, given the previous
condition.
- Should failedAcrossNodes be outside of the condition "failedShuffle >
remaining * allowed)
- Config param for abortFailureLimit ? (-1 defaults to code, anything else is
an absolute value). Reducing it to 15 only helps for smaller jobs. numInputs/10
will take over otherwise.
- Nit: "Preconditions.checkArgument(uniqueHosts.size() > 0, "No values in
unique hosts");" - can be after numUniqueHosts
- Nit: There's a TODO which needs to be removed
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)