[
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335225#comment-17335225
]
Konstantin Knauf commented on FLINK-22506:
------------------------------------------
I assume you are using application mode on YARN?
Generally, it is hard for the Job Manager to distinguish between a retryable
error (S3 temporarily unavailable) and non-retryable error like the one you
mention. So, the current behavior is to retry in any case. I would therefore
move this to an "Improvement". Could you share the logs of the Jobmanager?
If possible you could also try out Flink 1.12,
https://issues.apache.org/jira/browse/FLINK-16866 might have already changed
this behavior, but I would have to test this myself.
> YARN job cluster stuck in retrying creating JobManager if savepoint is
> corrupted
> --------------------------------------------------------------------------------
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.11.3
> Reporter: Paul Lin
> Priority: Major
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible)
> occurs during the initiation of the job manager, the job cluster exits with
> an error code. But since it does not mark the attempt as failed, it won't be
> count as a failed attempt, and YARN will keep retrying forever.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)