abstractdog commented on a change in pull request #152:
URL: https://github.com/apache/tez/pull/152#discussion_r753052967
##########
File path: tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
##########
@@ -300,6 +300,24 @@ public TezConfiguration(boolean loadDefaults) {
TEZ_AM_PREFIX + "max.allowed.time-sec.for-read-error";
public static final int
TEZ_AM_MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC_DEFAULT = 300;
+ /**
+ * Double value. Assuming that a certain number of downstream hosts reported
fetch failure for a
+ * given upstream host, this config drives the max allowed ratio of
(downstream hosts) / (all hosts).
+ * The total number of used hosts are tracked by AMNodeTracker, which
divides the distinct number of
+ * downstream hosts blaming source(upstream) tasks in a given vertex. If the
fraction is beyond this
+ * limit, the upstream task attempt is marked as failed (so blamed for the
fetch failure).
+ * E.g. if this set to 0.2, in case of 3 different hosts reporting fetch
failure
+ * for the same upstream host in a cluster which currently utilizes 10
nodes, the upstream task
+ * is immediately blamed for the fetch failure.
+ *
+ * Expert level setting.
+ */
+ @ConfigurationScope(Scope.AM)
+ @ConfigurationProperty(type="integer")
+ public static final String
TEZ_AM_MAX_ALLOWED_DOWNSTREAM_HOST_FAILURES_FRACTION =
+ TEZ_AM_PREFIX + "max.allowed.downstream.host.failures.fraction";
+ public static final double
TEZ_AM_MAX_ALLOWED_DOWNSTREAM_HOST_FAILURES_FRACTION_DEFAULT = 0.2;
Review comment:
I see, I'm assuming on that small cluster a fraction: 0.25 might work
properly (so in case of 4 hosts, 1 failing downstream won't make the source
restart immediately, at least 2 downstream reporting hosts are needed)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]