[jira] [Commented] (FLINK-21685) Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Yang Wang (Jira) Sun, 14 Mar 2021 20:31:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301360#comment-17301360
 ]


Yang Wang commented on FLINK-21685:
-----------------------------------

[~petrizhang] I think your analysis is right. The new leader did not write the 
leader information to the ConfigMap successfully. This is the root cause why it 
could not recover the jobs from latest successful checkpoints.

 

It is really strange that JobManager could still receive the "MODIFIED" event 
for leader ConfigMap, which means the leader elector could update the 
annotation successfully. Maybe you could double check the annotation of leader 
ConfigMap for verification. However, we could not write the leader information 
to the content of ConfigMap. And we also do not have any exceptions(e.g. 
resource conflicts).

 

I am curious which version of Kubernetes you are using? Is it a standard 
Kubernetes cluster or with some internal changes?

I have tested in the minikube and a real K8s cluster with version 1.18. It just 
works well and I could not reproduce such issue.

> Flink JobManager failed to restart from checkpoint in kubernetes HA setup
> -------------------------------------------------------------------------
>
>                 Key: FLINK-21685
>                 URL: https://issues.apache.org/jira/browse/FLINK-21685
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.12.1, 1.12.2
>            Reporter: Peng Zhang
>            Priority: Major
>         Attachments: 01-role.yaml, 02-role-binding.yaml, 03-config.yaml, 
> 06-jobmanager-deployment.yaml, 08-taskmanager-deployment.yaml, flink-ha.log, 
> scalyr-logs.txt.zip
>
>
> We use Flink K8S session cluster with HA mode (1 JobManager and 4 
> TaskManagers). When jobs are running in Flink, and JobManager restarted, 
> Flink JobManager failed to recover job from checkpoint
> {code}
> 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to fetch 1 checkpoints from storage. 
> 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to fetch 1 checkpoints from storage. 
> 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to retrieve checkpoint 1. 
> 2021-03-08 13:16:43,014 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring 
> job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 
> 9a534b2e309b24f78866b65d94082ead located at 
> s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
>  
> 2021-03-08 13:16:43,023 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
> state to restore 
> 2021-03-08 13:16:43,024 INFO  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Using failover strategy 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
>  for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 
> 2021-03-08 13:16:43,046 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager 
> runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) 
> was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 
> at akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. 
> 2021-03-08 13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport      
>              [] - Remote connection to [null] failed with 
> java.net.NoRouteToHostException: No route to host 
> 2021-03-08 13:16:43,060 WARN  akka.remote.ReliableDeliverySupervisor          
>              [] - Association with remote system 
> [akka.tcp://[email protected]:6123] has failed, address is now gated for 
> [50] ms. Reason: [Association failed with 
> [akka.tcp://[email protected]:6123]] Caused by: 
> [java.net.NoRouteToHostException: No route to host]
> {code}
> Attached is the log, and our configuration.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21685) Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Reply via email to