Kuhu Shukla commented on TEZ-3910:

An initial patch that removes the logic from the reducer to fail independently. 
This patch is not fully ready yet since it is missing to address the one 
condition where the number of failed downstream is higher than the new defined 
threshold and the reducer is not heathy. It is different for the ordered and 
the unordered case IMHO since unordered will retry while the other would want 
the downstream to fail and the upstream to run again? Appreciate any initial 
design comments and suggestion on what is the right approach here. I have added 
the is reducer healthy flag to the InputReadError event.

> Single node can cause Tez job to fail during shuffle
> ----------------------------------------------------
>                 Key: TEZ-3910
>                 URL: https://issues.apache.org/jira/browse/TEZ-3910
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.1
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3910.001.patch
> There is a race where a downstream task that is running into fetch failures 
> due to bad output from the upstream task can continue to blame itself for the 
> failure before the AM can do a re-run of the upstream offending task and fix 
> the fetch failure. This causes the DAG to fail even if a single node fails.

This message was sent by Atlassian JIRA

Reply via email to