[ https://issues.apache.org/jira/browse/TEZ-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ayush Saxena resolved TEZ-4400. ------------------------------- Resolution: Duplicate > Tez takes a long time to recover from shuffle data not found errors > ------------------------------------------------------------------- > > Key: TEZ-4400 > URL: https://issues.apache.org/jira/browse/TEZ-4400 > Project: Apache Tez > Issue Type: Bug > Reporter: Eric Payne > Priority: Minor > > Recently a lot of nodes ended up having their shuffle data wiped during an NM > upgrade. It took many of the TEZ jobs far too long to recover. This should be > something that can be quickly recovered. The NM is returning an error code > indicating the shuffle data was not found, and that alone is sufficient > evidence to know that no amount of retries is likely to fix the issue. As > soon as the NM reports shuffle data as not found, the task should report the > not found error to the AM and the AM should treat even a single not found > error as sufficient cause to re-run the upstream task. -- This message was sent by Atlassian Jira (v8.20.10#820010)