[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337517#comment-17337517
 ] 

Paul Lin edited comment on FLINK-22506 at 4/30/21, 5:08 PM:
------------------------------------------------------------

[~trohrmann] Thanks a lot for the input. Now I'm suspecting maybe the value of 
`yarn.application-attempt-failures-validity-interval` is too low (I'm using the 
default), given that in my case a retry may take 1 min. I'll investigate 
further, and close the issue if it's a configuration problem. Thanks again! 


was (Author: paul lin):
[~trohrmann] Thanks a lot for the input. Now I'm suspecting maybe the value of 
`yarn.application-attempt-failures-validity-interval` is low (I'm using the 
default), given that in my case a retry may take 1 min. I'll investigate 
further, and close the issue if it's a configuration problem. Thanks again! 

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-22506
>                 URL: https://issues.apache.org/jira/browse/FLINK-22506
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.11.3
>            Reporter: Paul Lin
>            Priority: Major
>         Attachments: corrupted_savepoint.log, yarn application attempts.png
>
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to