[ 
https://issues.apache.org/jira/browse/TEZ-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved TEZ-4400.
-------------------------------
    Resolution: Duplicate

> Tez takes a long time to recover from shuffle data not found errors
> -------------------------------------------------------------------
>
>                 Key: TEZ-4400
>                 URL: https://issues.apache.org/jira/browse/TEZ-4400
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Eric Payne
>            Priority: Minor
>
> Recently a lot of nodes ended up having their shuffle data wiped during an NM 
> upgrade. It took many of the TEZ jobs far too long to recover. This should be 
> something that can be quickly recovered. The NM is returning an error code 
> indicating the shuffle data was not found, and that alone is sufficient 
> evidence to know that no amount of retries is likely to fix the issue. As 
> soon as the NM reports shuffle data as not found, the task should report the 
> not found error to the AM and the AM should treat even a single not found 
> error as sufficient cause to re-run the upstream task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to