[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

Till Rohrmann (Jira) Fri, 09 Apr 2021 06:52:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318000#comment-17318000
 ]


Till Rohrmann edited comment on FLINK-22014 at 4/9/21, 1:51 PM:
----------------------------------------------------------------

In the logs, I couldn't find anything suspicious. It simply says that Flink can 
no longer find the job under 
{{s3a://yyyy/recovery/stellar-flink-cluster/submittedJobGraph6797768d0737}}.

However, I also found the following line in the logs 

{{java.io.FileNotFoundException: No such file or directory: 
s3a://yyyy/recovery/stellar-flink-cluster/blob/job_d9ded24224aab7c7041420b3efc1b6ba/blob_p-7a8321732b802055a30ecce46aa12d6d7bfc1120-2ea8cb6e2fbe05174ebeb74e141a9ab6}}
 

which is logged by a TaskManager. This indicates that some blobs stored in the 
S3 directory have been deleted. Since I couldn't find a JobManager restart 
before this log line (the log is unfortunately not complete), this indicates 
that something might have indeed messed around with the s3 storage. This could 
also explain the lost submitted job graph data.

[~mlushchytski] could you double check whether there has been some other s3 
access to the s3a://yyyy/recovery/stellar-flink-cluster bucket?


was (Author: till.rohrmann):
In the logs, I couldn't find anything suspicious. It simply says that Flink can 
no longer find the job under 
{{s3a://yyyy/recovery/stellar-flink-cluster/submittedJobGraph6797768d0737}}.

However, I also found the following line in the logs 
{{java.io.FileNotFoundException: No such file or directory: 
s3a://yyyy/recovery/stellar-flink-cluster/blob/job_d9ded24224aab7c7041420b3efc1b6ba/blob_p-7a8321732b802055a30ecce46aa12d6d7bfc1120-2ea8cb6e2fbe05174ebeb74e141a9ab6}}
 which is logged by a TaskManager. This indicates that some blobs stored in the 
S3 directory have been deleted. Since I couldn't find a JobManager restart 
before this log line (the log is unfortunately not complete), this indicates 
that something might have indeed messed around with the s3 storage. This could 
also explain the lost submitted job graph data.

[~mlushchytski] could you double check whether there has been some other s3 
access to the s3a://yyyy/recovery/stellar-flink-cluster bucket?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> -----------------------------------------------------------------------
>
>                 Key: FLINK-22014
>                 URL: https://issues.apache.org/jira/browse/FLINK-22014
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.11.3, 1.12.2, 1.13.0
>            Reporter: Mikalai Lushchytski
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: k8s-ha, pull-request-available
>             Fix For: 1.11.4, 1.13.0, 1.12.3
>
>         Attachments: flink-logs.txt.zip
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>    at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>    at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>    at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>    at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
>    at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255
>  undefined) ~[?:?]
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149
>  undefined) ~[?:?]
>    at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088 
> undefined) ~[?:?]
>    at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699 
> undefined) ~[?:?]
>    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined) 
> ~[?:?]
>    at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131
>  undefined) ~[?:?]
>    at 
> org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37
>  undefined) ~[?:?]
>    at 
> org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>    at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

Reply via email to