[
https://issues.apache.org/jira/browse/TEZ-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804876#comment-14804876
]
Rajesh Balamohan commented on TEZ-814:
--------------------------------------
lgtm. +1. Even when tez.task.max.allowed.output.failures &
tez.task.max.allowed.output.failures.fraction are not converging, this would
end up restarting producer after 300 seconds in case of output read-error.
Should this be backported to 0.6 and 0.5 as well?
> Improve heuristic for determining a task has failed outputs
> -----------------------------------------------------------
>
> Key: TEZ-814
> URL: https://issues.apache.org/jira/browse/TEZ-814
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Fix For: 0.7.1
>
> Attachments: TEZ-814.1.patch, TEZ-814.2.patch
>
>
> Currently 25% of consumers need to report failure. However we may not always
> have those many error reports. Eg. this is the last consumer and it the
> source is lost. Or some consumers are cut off from the source. The job may
> hang on those consumers waiting for a re-run.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)