[jira] [Commented] (TEZ-4400) Tez takes a long time to recover from shuffle data not found errors

Jira Thu, 31 Mar 2022 02:15:07 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-4400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515185#comment-17515185
 ]


László Bodor commented on TEZ-4400:
-----------------------------------

thanks for reporting this [~epayne]!
please take a look at TEZ-4233, I think that's exactly about to solve this 
problem, basically, if the ShuffleHandler (in NM) fails to find the file, it 
reports a specific message, so fetcher can report it forward to the AM

> Tez takes a long time to recover from shuffle data not found errors
> -------------------------------------------------------------------
>
>                 Key: TEZ-4400
>                 URL: https://issues.apache.org/jira/browse/TEZ-4400
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Eric Payne
>            Priority: Minor
>
> Recently a lot of nodes ended up having their shuffle data wiped during an NM 
> upgrade. It took many of the TEZ jobs far too long to recover. This should be 
> something that can be quickly recovered. The NM is returning an error code 
> indicating the shuffle data was not found, and that alone is sufficient 
> evidence to know that no amount of retries is likely to fix the issue. As 
> soon as the NM reports shuffle data as not found, the task should report the 
> not found error to the AM and the AM should treat even a single not found 
> error as sufficient cause to re-run the upstream task.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (TEZ-4400) Tez takes a long time to recover from shuffle data not found errors

Reply via email to