Mikalai Lushchytski created FLINK-22014:
-------------------------------------------
Summary: Flink JobManager failed to restart after failure in
kubernetes HA setup
Key: FLINK-22014
URL: https://issues.apache.org/jira/browse/FLINK-22014
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes
Affects Versions: 1.12.2
Reporter: Mikalai Lushchytski
After the JobManager pod failed and the new one started, it was not able to
recover jobs due to the absence of recovery data in storage - config map
pointed at not existing file.
Due to this the JobManager pod entered into the `CrashLoopBackOff`state and was
not able to recover - each attempt failed with the same error so the whole
cluster became unrecoverable and not operating.
I had to manually delete the config map and start the jobs again without the
save point.
If I tried to emulate the failure further by deleting job manager pod manually,
the new pod every time recovered well and issue was not reproducible anymore
artificially.
Below is the failure log:
```
2021-03-26 08:22:57,925 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] -
Starting the SlotManager.
2021-03-26 08:22:57,928 INFO
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] -
Starting DefaultLeaderRetrievalService with
KubernetesLeaderRetrievalDriver{configMapName='stellar-flink-cluster-dispatcher-leader'}.
2021-03-26 08:22:57,931 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job ids
[198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940,
96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from
KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
2021-03-26 08:22:57,933 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] -
Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
2021-03-26 08:22:58,029 INFO
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] -
Stopping SessionDispatcherLeaderProcess.
2021-03-26 08:28:22,677 INFO
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping
DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred
in the cluster entrypoint. java.util.concurrent.CompletionException:
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id
198c46bac791e73ebcc565a550fa4ff6. at
java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
[?:?] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[?:?] at java.lang.Thread.run(Unknown Source) [?:?] Caused by:
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id
198c46bac791e73ebcc565a550fa4ff6. at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113)
~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more Caused by:
org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. This
indicates that the retrieved state handle is broken. Try cleaning the state
handle store. at
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113)
~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more Caused by:
java.io.FileNotFoundException: No such file or directory:
s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255)
~[?:?] at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149)
~[?:?] at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088)
~[?:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699)
~[?:?] at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) ~[?:?] at
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131)
~[?:?] at
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37)
~[?:?] at
org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
~[flink-dist_2.12-1.12.2.jar:1.12.2] at
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113)
~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)