[
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikalai Lushchytski updated FLINK-22014:
----------------------------------------
Attachment: flink-logs.txt.zip
> Flink JobManager failed to restart after failure in kubernetes HA setup
> -----------------------------------------------------------------------
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.11.3, 1.12.2, 1.13.0
> Reporter: Mikalai Lushchytski
> Assignee: Till Rohrmann
> Priority: Critical
> Labels: k8s-ha, pull-request-available
> Fix For: 1.11.4, 1.13.0, 1.12.3
>
> Attachments: flink-logs.txt.zip
>
>
> After the JobManager pod failed and the new one started, it was not able to
> recover jobs due to the absence of recovery data in storage - config map
> pointed at not existing file.
>
> Due to this the JobManager pod entered into the `CrashLoopBackOff`state and
> was not able to recover - each attempt failed with the same error so the
> whole cluster became unrecoverable and not operating.
>
> I had to manually delete the config map and start the jobs again without the
> save point.
>
> If I tried to emulate the failure further by deleting job manager pod
> manually, the new pod every time recovered well and issue was not
> reproducible anymore artificially.
>
> Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
> Starting the SlotManager.
> 2021-03-26 08:22:57,928 INFO
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
> 2021-03-26 08:22:57,931 INFO
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940,
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess []
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
> 2021-03-26 08:22:58,029 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess []
> - Stopping SessionDispatcherLeaderProcess.
> 2021-03-26 08:28:22,677 INFO
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException:
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job
> id 198c46bac791e73ebcc565a550fa4ff6.
> at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
> ~[?:?]
> at java.util.concurrent.CompletableFuture.completeThrowable(Unknown
> Source) [?:?]
> at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
> [?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
> at java.lang.Thread.run(Unknown Source) [?:?] Caused by:
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job
> id 198c46bac791e73ebcc565a550fa4ff6.
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6.
> This indicates that the retrieved state handle is broken. Try cleaning the
> state handle store.
> at
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
> Caused by: java.io.FileNotFoundException: No such file or directory:
> s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255
> undefined) ~[?:?]
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149
> undefined) ~[?:?]
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088
> undefined) ~[?:?]
> at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699
> undefined) ~[?:?]
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined)
> ~[?:?]
> at
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131
> undefined) ~[?:?]
> at
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37
> undefined) ~[?:?]
> at
> org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
> at
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
> undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)