[
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967481#comment-14967481
]
Rajesh Balamohan edited comment on TEZ-2882 at 10/21/15 5:09 PM:
-----------------------------------------------------------------
Thanks [~sseth]
This is the one I'm concerned about - and think is a candidate for special
casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before
returning true. This should be good for small clusters? Basically a combination
of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold
governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST.
"if (hostFailureFraction != -1) " - Float comparison
- Fixed.
"failedShufflesSinceLastCompletion" - Looking at this some more - do we need
some mechanism to disable this ?
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to
disable this and a test
"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.
Will commit it once jenkins passes.
was (Author: rajesh.balamohan):
Thanks @sseth
This is the one I'm concerned about - and think is a candidate for special
casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before
returning true. This should be good for small clusters? Basically a combination
of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold
governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST.
"if (hostFailureFraction != -1) " - Float comparison
- Fixed.
"failedShufflesSinceLastCompletion" - Looking at this some more - do we need
some mechanism to disable this ?
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to
disable this and a test
"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.
Will commit it once jenkins passes.
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch,
> TEZ-2882.4.patch, TEZ-2882.5.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)