[ 
https://issues.apache.org/jira/browse/TEZ-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791456#comment-14791456
 ] 

Bikas Saha commented on TEZ-814:
--------------------------------

Heuristics are mainly designed to prevent inadvertent flurry of re-runs due to 
intermittent network issues. So we have fraction and unique failures reported 
heuristics to verify that multiple readers are reporting the same failure.

Regardless of these current and future heuristics we need to ensure indefinite 
job hangs due to non convergent heuristics. So this patch adds a time based 
deadline. If a consumer attempt reports a read error for a timespan exceeding a 
threshold (default 300s) then the producer attempt will be re-run.

[~rajesh.balamohan] [~hitesh] Please review

> Improve heuristic for determining a task has failed outputs
> -----------------------------------------------------------
>
>                 Key: TEZ-814
>                 URL: https://issues.apache.org/jira/browse/TEZ-814
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>         Attachments: TEZ-814.1.patch
>
>
> Currently 25% of consumers need to report failure. However we may not always 
> have those many error reports. Eg. this is the last consumer and it the 
> source is lost. Or some consumers are cut off from the source. The job may 
> hang on those consumers waiting for a re-run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to