[jira] [Commented] (TEZ-3910) Single node can cause Tez job to fail during shuffle

Jonathan Turner Eagles (Jira) Wed, 23 Jul 2025 12:33:04 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18009372#comment-18009372
 ]


Jonathan Turner Eagles commented on TEZ-3910:
---------------------------------------------

One of the issues on why this is held up in the transition from FATAL attempt 
failures to NON_FATAL attempt failures. This can leave the job open to 
indefinite hangs if there are unhandled cases.

In addition to the author of the patch no longer active. Are you interested in 
picking up the work, [~sudeni]?

> Single node can cause Tez job to fail during shuffle
> ----------------------------------------------------
>
>                 Key: TEZ-3910
>                 URL: https://issues.apache.org/jira/browse/TEZ-3910
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3910.001.patch, TEZ-3910.002.patch, 
> TEZ-3910.003.patch, TEZ-3910.004.patch, TEZ-3910.005.patch
>
>
> There is a race where a downstream task that is running into fetch failures 
> due to bad output from the upstream task can continue to blame itself for the 
> failure before the AM can do a re-run of the upstream offending task and fix 
> the fetch failure. This causes the DAG to fail even if a single node fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEZ-3910) Single node can cause Tez job to fail during shuffle

Reply via email to