[jira] [Comment Edited] (TEZ-4139) Tez should consider node information for computing failure fraction

Jira Sat, 28 Aug 2021 03:11:07 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405759#comment-17405759
 ]


László Bodor edited comment on TEZ-4139 at 8/28/21, 10:10 AM:
--------------------------------------------------------------

[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a 
patch soon
checked the conversation, seems like we're about to consider downstream hosts, 
but I would like to consider upstream hosts too because recently I face shuffle 
issues where lots of read error happens due to a single node failure (reducer 
tasks from different hosts cannot fetch from a single map host), and even if 
the mapper task is marked as OUTPUT_LOST, it's late as reducer attempts already 
failed

I would like to handle both upstream and downstream hosts, please let me know 
if it doesn't make sense, so what I'm trying to achieve now is:

1. regarding downstream hosts: the original intention of this patch: consider 
failures from the same downstream node as single failure => postpone mapper 
task blaming (OUTPUT_LOST) if read error is likely because of downstream

2. regarding upstream hosts: collect all reported upstream mapper task attempts 
for a vertex, and if it's beyond a certain amount for the same source(map) 
host, blame mapper task immediately => blame mapper task attempt as soon as 
possible if read error is likely because of upstream node failure (somewhat 
similar to TEZ-3910)

can these changes go into the same patch?


was (Author: abstractdog):
[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a 
patch soon
checked the conversation, seems like we're about to consider downstream hosts, 
but I would like to consider upstream hosts too because recently I face shuffle 
issues where lots of read error happens due to a single node failure (reducer 
tasks from different hosts cannot fetch from a single map host), and even if 
the mapper task is marked as OUTPUT_LOST, task attempts fail because of the 
bumped up failure fraction

I would like to handle both upstream and downstream hosts, please let me know 
if it doesn't make sense

> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
>                 Key: TEZ-4139
>                 URL: https://issues.apache.org/jira/browse/TEZ-4139
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4139.01.WIP.patch, TEZ-4139.02.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source 
> task, source task is marked as failed and it is retried. Currently failure 
> fraction is handled by looking at unique task attempts from downstream. 
> However, it should consider taking into account node information for 
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TEZ-4139) Tez should consider node information for computing failure fraction

Reply via email to