[
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337499#comment-17337499
]
Paul Lin edited comment on FLINK-22506 at 4/30/21, 4:42 PM:
------------------------------------------------------------
[~knaufk] I've attached the jm logs and the screen shoot of YARN application
web UI. Please take a look. I first reported the issue as a bug, because I
think the max number of attempts (which is set to 2) is not respected in this
case, but I'm fine with making it an improvement.
was (Author: paul lin):
[~knaufk] I've attached the jm logs and the screen shoot yarn application web
UI. Please take a look. I first reported the issue as a bug, because I think
the max number of attempts (which is set to 2) is not respected in this case,
but I'm fine with making it an improvement.
> YARN job cluster stuck in retrying creating JobManager if savepoint is
> corrupted
> --------------------------------------------------------------------------------
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
> Issue Type: Improvement
> Components: Deployment / YARN
> Affects Versions: 1.11.3
> Reporter: Paul Lin
> Priority: Major
> Attachments: corrupted_savepoint.log, yarn application attempts.png
>
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible)
> occurs during the initiation of the job manager, the job cluster exits with
> an error code. But since it does not mark the attempt as failed, it won't be
> count as a failed attempt, and YARN will keep retrying forever.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)