[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967481#comment-14967481
 ] 

Rajesh Balamohan edited comment on TEZ-2882 at 10/21/15 5:09 PM:
-----------------------------------------------------------------

Thanks [~sseth]

This is the one I'm concerned about - and think is a candidate for special 
casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before 
returning true. This should be good for small clusters? Basically a combination 
of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold 
governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST. 

"if (hostFailureFraction != -1) " - Float comparison
- Fixed.

"failedShufflesSinceLastCompletion" - Looking at this some more - do we need 
some mechanism to disable this ? 
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to 
disable this and a test

"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.

Will commit it once jenkins passes.


was (Author: rajesh.balamohan):
Thanks @sseth

This is the one I'm concerned about - and think is a candidate for special 
casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before 
returning true. This should be good for small clusters? Basically a combination 
of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold 
governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST. 

"if (hostFailureFraction != -1) " - Float comparison
- Fixed.

"failedShufflesSinceLastCompletion" - Looking at this some more - do we need 
some mechanism to disable this ? 
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to 
disable this and a test

"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.

Will commit it once jenkins passes.

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch, 
> TEZ-2882.4.patch, TEZ-2882.5.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to