[jira] [Commented] (FLINK-21685) Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Peng Zhang (Jira) Tue, 09 Mar 2021 23:38:13 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298595#comment-17298595
 ]


Peng Zhang commented on FLINK-21685:
------------------------------------

[~fly_in_gis] we are running in a production K8S cluster, so it is unlikely 
that there is network issue between JobManager pod and Kubernetes APIServer. 
Would you be able to try in a K8S cluster (not in minikube since the K8S 
behaviour might be different)?

 

And when you said that you could reproduce the issue, could you explain more if 
there is an issue or how you work around it?

 

 

> Flink JobManager failed to restart from checkpoint in kubernetes HA setup
> -------------------------------------------------------------------------
>
>                 Key: FLINK-21685
>                 URL: https://issues.apache.org/jira/browse/FLINK-21685
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.12.1, 1.12.2
>            Reporter: Peng Zhang
>            Priority: Major
>         Attachments: 03-config.yaml, 06-jobmanager-deployment.yaml, 
> 08-taskmanager-deployment.yaml, flink-ha.log
>
>
> We use Flink K8S session cluster with HA mode (1 JobManager and 4 
> TaskManagers). When jobs are running in Flink, and JobManager restarted, 
> Flink JobManager failed to recover job from checkpoint
>  
> {{2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to retrieve checkpoint 1. 2021-03-08 13:16:43,014 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring 
> job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 
> 9a534b2e309b24f78866b65d94082ead located at 
> s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
>  2021-03-08 13:16:43,023 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
> state to restore 2021-03-08 13:16:43,024 INFO  
> org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
> failover strategy 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
>  for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 
> 2021-03-08 13:16:43,046 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager 
> runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) 
> was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 
> at akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. 2021-03-08 
> 13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport                 
>   [] - Remote connection to [null] failed with 
> java.net.NoRouteToHostException: No route to host 2021-03-08 13:16:43,060 
> WARN  akka.remote.ReliableDeliverySupervisor                       [] - 
> Association with remote system [akka.tcp://[email protected]:6123] has 
> failed, address is now gated for [50] ms. Reason: [Association failed with 
> [akka.tcp://[email protected]:6123]] Caused by: 
> [java.net.NoRouteToHostException: No route to host] }}
>  
> Attached is the log, and our configuration.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21685) Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Reply via email to