[
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110290#comment-17110290
]
Panagiotis Garefalakis edited comment on TEZ-4183 at 5/18/20, 3:46 PM:
-----------------------------------------------------------------------
Hey [~jeagles] thanks for the extra details, found them pretty useful.
I created a patch for the unordered Fetcher that can now keep track diskRead
errors (similar to the unordered one) and makes use of both time- and
threshold-base batching.
In more detail, the AM from that Fetcher is informed:
* Immediately if maxTimeToWaitForReportMillis is 0 (similar to
reportReadErrorImmediately in unordered Fetcher)
* When time exceeded SHUFFLE_BATCH_WAIT ms (batch events)
* When more than THRESHOLD readErrors occurred for a particular task_attempt --
5 maxFetchFailuresBeforeReporting by default (batch events)
Thoughts here? cc: [~abstractdog] [~prasanth_j]
was (Author: pgaref):
Hey [~jeagles] thanks for the extra details, found them pretty useful.
I created a patch for the unordered Fetcher that can now keep track diskRead
errors (similar to the unordered one) and makes use of both time- and
threshold-base batching.
In more detail, the AM from that Fetcher is informed:
* Immediately if maxTimeToWaitForReportMillis is 0 (similar to
reportReadErrorImmediately in unordered Fetcher)
* When time exceeded SHUFFLE_BATCH_WAIT ms (batch events)
* When more than THRESHOLD readErrors occurred for a particular task_attempt --
5 maxFetchFailuresBeforeReporting by default (batch events)
Thoughts here? cc: [~abstractdog]
> Time- and threshold-batched FetchFailure event propagation to AM
> ----------------------------------------------------------------
>
> Key: TEZ-4183
> URL: https://issues.apache.org/jira/browse/TEZ-4183
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Panagiotis Garefalakis
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Fetcher currently sends failure events to AM as soon as they are discovered:
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a
> Threshold send the message immediately to AM (instead of waiting)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)