[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

Paul Lin (Jira) Fri, 30 Apr 2021 07:20:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337421#comment-17337421
 ]


Paul Lin commented on FLINK-22506:
----------------------------------

[~knaufk] Actually I'm not using the application mode, and this issue has been 
around for a very long time. I've tried 1.12.1, and the problem still exists.

[~trohrmann] I agree that it's hard to distinguish non-retryable errors from 
the other ones. I think a simple thought to solve the problem is to make the 
attempt failed when an retryable or non-retryable error occurs, and leave YARN 
to decide whether the application should be restarted. The total restarts would 
be restricted by `yarn.application-attempts` and 
`yarn.application-attempt-failures-validity-interval`. 

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-22506
>                 URL: https://issues.apache.org/jira/browse/FLINK-22506
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.11.3
>            Reporter: Paul Lin
>            Priority: Major
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

Reply via email to