[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

Adrian Vasiliu (Jira) Sun, 28 Nov 2021 14:54:03 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ]


Adrian Vasiliu edited comment on FLINK-22014 at 11/28/21, 10:53 PM:
--------------------------------------------------------------------

Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be 
reopened or should a new issue be open?


was (Author: JIRAUSER280892):
Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> -----------------------------------------------------------------------
>
>                 Key: FLINK-22014
>                 URL: https://issues.apache.org/jira/browse/FLINK-22014
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.11.3, 1.12.2, 1.13.0
>            Reporter: Mikalai Lushchytski
>            Priority: Major
>              Labels: k8s-ha, pull-request-available
>         Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>    at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>    at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>    at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>    at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
>    at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255
>  undefined) ~[?:?]
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149
>  undefined) ~[?:?]
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088 
> undefined) ~[?:?]
>    at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699 
> undefined) ~[?:?]
>    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined) 
> ~[?:?]
>    at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131
>  undefined) ~[?:?]
>    at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37
>  undefined) ~[?:?]
>    at 
> org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

Reply via email to