[
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084062#comment-17084062
]
László Bodor commented on TEZ-4139:
-----------------------------------
thanks [~rajesh.balamohan], uploaded [^TEZ-4139.02.WIP.patch] , which is still
a wip patch, and added support of destination localhostname to the event
reflecting on what you've said:
"consider failures from same node as single failure" <-- for me, it means that
we use the distinct count of destination hosts in the numerator which reported
fetch failure, am I right? it so, I'll update the computation to use
"getDistinctDestinationHostsCountWithInputReadError" instead of
"getTotalUniqueReportsCount" (I still used getTotalUniqueReportsCount in 02.WIP
to validate that it's working according to the original assumption ==
TestTaskAttempt passes)
> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
> Key: TEZ-4139
> URL: https://issues.apache.org/jira/browse/TEZ-4139
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: László Bodor
> Priority: Major
> Attachments: TEZ-4139.01.WIP.patch, TEZ-4139.02.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source
> task, source task is marked as failed and it is retried. Currently failure
> fraction is handled by looking at unique task attempts from downstream.
> However, it should consider taking into account node information for
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849
--
This message was sent by Atlassian Jira
(v8.3.4#803005)