[
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405759#comment-17405759
]
László Bodor edited comment on TEZ-4139 at 8/28/21, 10:10 AM:
--------------------------------------------------------------
[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a
patch soon
checked the conversation, seems like we're about to consider downstream hosts,
but I would like to consider upstream hosts too because recently I face shuffle
issues where lots of read error happens due to a single node failure (reducer
tasks from different hosts cannot fetch from a single map host), and even if
the mapper task is marked as OUTPUT_LOST, it's late as reducer attempts already
failed
I would like to handle both upstream and downstream hosts, please let me know
if it doesn't make sense, so what I'm trying to achieve now is:
1. regarding downstream hosts: the original intention of this patch: consider
failures from the same downstream node as single failure => postpone mapper
task blaming (OUTPUT_LOST) if read error is likely because of downstream
2. regarding upstream hosts: collect all reported upstream mapper task attempts
for a vertex, and if it's beyond a certain amount for the same source(map)
host, blame mapper task immediately => blame mapper task attempt as soon as
possible if read error is likely because of upstream node failure (somewhat
similar to TEZ-3910)
can these changes go into the same patch?
was (Author: abstractdog):
[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a
patch soon
checked the conversation, seems like we're about to consider downstream hosts,
but I would like to consider upstream hosts too because recently I face shuffle
issues where lots of read error happens due to a single node failure (reducer
tasks from different hosts cannot fetch from a single map host), and even if
the mapper task is marked as OUTPUT_LOST, task attempts fail because of the
bumped up failure fraction
I would like to handle both upstream and downstream hosts, please let me know
if it doesn't make sense
> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
> Key: TEZ-4139
> URL: https://issues.apache.org/jira/browse/TEZ-4139
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: László Bodor
> Priority: Major
> Attachments: TEZ-4139.01.WIP.patch, TEZ-4139.02.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source
> task, source task is marked as failed and it is retried. Currently failure
> fraction is handled by looking at unique task attempts from downstream.
> However, it should consider taking into account node information for
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849
--
This message was sent by Atlassian Jira
(v8.3.4#803005)