[
https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405759#comment-17405759
]
László Bodor edited comment on TEZ-4139 at 8/27/21, 11:43 AM:
--------------------------------------------------------------
[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a
patch soon
checked the conversation, seems like we're about to consider downstream hosts,
but I would like to consider upstream hosts too because recently I face shuffle
issues where lots of read error happens due to a single node failure (reducer
tasks from different hosts cannot fetch from a single map host), and even if
the mapper task is marked as OUTPUT_LOST, task attempts fail because of the
bumped up failure fraction
I would like to handle both upstream and downstream hosts, please let me know
if it doesn't make sense
was (Author: abstractdog):
[~rajesh.balamohan]: I'm about to refresh this patch again, I hope can create a
patch soon
checked the conversation, seems like we're about to consider downstream hosts,
but I would like to consider upstream hosts too because recently I face shuffle
issues where lots of read error happens due to a single node failure, and even
if the mapper task is marked as OUTPUT_LOST, task attempts fail because of the
bumped up failure fraction
I would like to handle both upstream and downstream hosts, please let me know
if it doesn't make sense
> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
> Key: TEZ-4139
> URL: https://issues.apache.org/jira/browse/TEZ-4139
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: László Bodor
> Priority: Major
> Attachments: TEZ-4139.01.WIP.patch, TEZ-4139.02.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source
> task, source task is marked as failed and it is retried. Currently failure
> fraction is handled by looking at unique task attempts from downstream.
> However, it should consider taking into account node information for
> computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849
--
This message was sent by Atlassian Jira
(v8.3.4#803005)