[ 
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405759#comment-17405759
 ] 

László Bodor commented on TEZ-4139:
-----------------------------------

[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a 
patch soon
checked the conversation, seems like we're about to consider downstream hosts, 
but I would like to consider upstream hosts too because recently I face shuffle 
issues where lots of read error happens due to a single node failure, and even 
if the mapper task is marked as OUTPUT_LOST, task attempts fail because of the 
bumped up failure fraction

I would like to handle both upstream and downstream hosts, please let me know 
if it doesn't make sense

> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
>                 Key: TEZ-4139
>                 URL: https://issues.apache.org/jira/browse/TEZ-4139
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4139.01.WIP.patch, TEZ-4139.02.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source 
> task, source task is marked as failed and it is retried. Currently failure 
> fraction is handled by looking at unique task attempts from downstream. 
> However, it should consider taking into account node information for 
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to