[
https://issues.apache.org/jira/browse/TEZ-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515185#comment-17515185
]
László Bodor commented on TEZ-4400:
-----------------------------------
thanks for reporting this [~epayne]!
please take a look at TEZ-4233, I think that's exactly about to solve this
problem, basically, if the ShuffleHandler (in NM) fails to find the file, it
reports a specific message, so fetcher can report it forward to the AM
> Tez takes a long time to recover from shuffle data not found errors
> -------------------------------------------------------------------
>
> Key: TEZ-4400
> URL: https://issues.apache.org/jira/browse/TEZ-4400
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Eric Payne
> Priority: Minor
>
> Recently a lot of nodes ended up having their shuffle data wiped during an NM
> upgrade. It took many of the TEZ jobs far too long to recover. This should be
> something that can be quickly recovered. The NM is returning an error code
> indicating the shuffle data was not found, and that alone is sufficient
> evidence to know that no amount of retries is likely to fix the issue. As
> soon as the NM reports shuffle data as not found, the task should report the
> not found error to the AM and the AM should treat even a single not found
> error as sufficient cause to re-run the upstream task.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)