[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959868#comment-14959868
 ] 

Siddharth Seth commented on TEZ-2882:
-------------------------------------

bq. 1) numInputs = 1 & errors happen. If so, it would be handled by 
hasFailedAcrossNodes() and fetching would be declared unhealthy. Along with 
this, stall duration is included to kill the consumer.
This is the one I'm concerned about - and think is a candidate for special 
casing. For up to 5 node clusters, failing to fetch from a single node will 
cause the consumer to fail. This could very easily have been the source instead.

{code}+    this.minFailurePerHost = Math.max(0, conf.getInt(
+        TezRuntimeConfiguration.TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST,
+        
TezRuntimeConfiguration.TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST_DEFAULT));
+    this.hostFailureFraction = conf.getFloat(TezRuntimeConfiguration
{code}
Curious. Why Math.max(0, ...). Read the value directly and error check it ?

{code}
if (hostFailureFraction != -1) {
{code}
Float comparison. A <0 or range around -1 would be better.

{code}
+        if (failedShufflesSinceLastCompletion >=
+            remainingMaps.get() * minFailurePerHost) {
{code}
Looking at this some more - do we need some mechanism to disable this ? 
minFailurePerHost is also used by the nodeFractionFailure heuristics - so this 
can't be changed independently.

{code}
+          fetcherHealthy =
+              (((float) failedShufflesSinceLastCompletion / (
+                  failedShufflesSinceLastCompletion + remainingMaps.get()))
+                  < maxStallTimeFraction);
{code}
<maxStallTimeFraction or maxAllowedFailedFetchFraction ?

Rest looks good to me.

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch, 
> TEZ-2882.4.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to