[
https://issues.apache.org/jira/browse/TEZ-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514953#comment-17514953
]
Eric Payne commented on TEZ-4400:
---------------------------------
Cc: [~jeagles]
> Tez takes a long time to recover from shuffle data not found errors
> -------------------------------------------------------------------
>
> Key: TEZ-4400
> URL: https://issues.apache.org/jira/browse/TEZ-4400
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Eric Payne
> Priority: Minor
>
> Recently a lot of nodes ended up having their shuffle data wiped during an NM
> upgrade. It took many of the TEZ jobs far too long to recover. This should be
> something that can be quickly recovered. The NM is returning an error code
> indicating the shuffle data was not found, and that alone is sufficient
> evidence to know that no amount of retries is likely to fix the issue. As
> soon as the NM reports shuffle data as not found, the task should report the
> not found error to the AM and the AM should treat even a single not found
> error as sufficient cause to re-run the upstream task.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)