[
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959868#comment-14959868
]
Siddharth Seth commented on TEZ-2882:
-------------------------------------
bq. 1) numInputs = 1 & errors happen. If so, it would be handled by
hasFailedAcrossNodes() and fetching would be declared unhealthy. Along with
this, stall duration is included to kill the consumer.
This is the one I'm concerned about - and think is a candidate for special
casing. For up to 5 node clusters, failing to fetch from a single node will
cause the consumer to fail. This could very easily have been the source instead.
{code}+ this.minFailurePerHost = Math.max(0, conf.getInt(
+ TezRuntimeConfiguration.TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST,
+
TezRuntimeConfiguration.TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST_DEFAULT));
+ this.hostFailureFraction = conf.getFloat(TezRuntimeConfiguration
{code}
Curious. Why Math.max(0, ...). Read the value directly and error check it ?
{code}
if (hostFailureFraction != -1) {
{code}
Float comparison. A <0 or range around -1 would be better.
{code}
+ if (failedShufflesSinceLastCompletion >=
+ remainingMaps.get() * minFailurePerHost) {
{code}
Looking at this some more - do we need some mechanism to disable this ?
minFailurePerHost is also used by the nodeFractionFailure heuristics - so this
can't be changed independently.
{code}
+ fetcherHealthy =
+ (((float) failedShufflesSinceLastCompletion / (
+ failedShufflesSinceLastCompletion + remainingMaps.get()))
+ < maxStallTimeFraction);
{code}
<maxStallTimeFraction or maxAllowedFailedFetchFraction ?
Rest looks good to me.
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch,
> TEZ-2882.4.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)