[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337421#comment-17337421 ]
Paul Lin commented on FLINK-22506: ---------------------------------- [~knaufk] Actually I'm not using the application mode, and this issue has been around for a very long time. I've tried 1.12.1, and the problem still exists. [~trohrmann] I agree that it's hard to distinguish non-retryable errors from the other ones. I think a simple thought to solve the problem is to make the attempt failed when an retryable or non-retryable error occurs, and leave YARN to decide whether the application should be restarted. The total restarts would be restricted by `yarn.application-attempts` and `yarn.application-attempt-failures-validity-interval`. > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > -------------------------------------------------------------------------------- > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN > Affects Versions: 1.11.3 > Reporter: Paul Lin > Priority: Major > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)