[jira] [Updated] (FLINK-21685) Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Till Rohrmann (Jira) Wed, 10 Mar 2021 00:13:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-21685:
----------------------------------
    Description: 
We use Flink K8S session cluster with HA mode (1 JobManager and 4 
TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink 
JobManager failed to recover job from checkpoint


{code}
2021-03-08 13:16:42,962 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to fetch 1 checkpoints from storage. 
2021-03-08 13:16:42,962 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to fetch 1 checkpoints from storage. 
2021-03-08 13:16:42,962 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to retrieve checkpoint 1. 
2021-03-08 13:16:43,014 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 
9a534b2e309b24f78866b65d94082ead located at 
s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
 
2021-03-08 13:16:43,023 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore 
2021-03-08 13:16:43,024 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Using failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
 for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 
2021-03-08 13:16:43,046 INFO  
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager 
runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) 
was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 at 
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. 
2021-03-08 13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport        
           [] - Remote connection to [null] failed with 
java.net.NoRouteToHostException: No route to host 
2021-03-08 13:16:43,060 WARN  akka.remote.ReliableDeliverySupervisor            
           [] - Association with remote system 
[akka.tcp://[email protected]:6123] has failed, address is now gated for [50] 
ms. Reason: [Association failed with [akka.tcp://[email protected]:6123]] 
Caused by: [java.net.NoRouteToHostException: No route to host]
{code}


Attached is the log, and our configuration.

 

  was:
We use Flink K8S session cluster with HA mode (1 JobManager and 4 
TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink 
JobManager failed to recover job from checkpoint

 

{{2021-03-08 13:16:42,962 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to retrieve checkpoint 1. 2021-03-08 13:16:43,014 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 
9a534b2e309b24f78866b65d94082ead located at 
s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
 2021-03-08 13:16:43,023 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore 2021-03-08 13:16:43,024 INFO  
org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using 
failover strategy 
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
 for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 2021-03-08 
13:16:43,046 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      
[] - JobManager runner for job BrandCollectionTrackingJob 
(9a534b2e309b24f78866b65d94082ead) was granted leadership with session id 
c258d8ce-69d3-49df-8bee-1b748d5bbe74 at 
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. 2021-03-08 
13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport                   
[] - Remote connection to [null] failed with java.net.NoRouteToHostException: 
No route to host 2021-03-08 13:16:43,060 WARN  
akka.remote.ReliableDeliverySupervisor                       [] - Association 
with remote system [akka.tcp://[email protected]:6123] has failed, address is 
now gated for [50] ms. Reason: [Association failed with 
[akka.tcp://[email protected]:6123]] Caused by: 
[java.net.NoRouteToHostException: No route to host] }}

 

Attached is the log, and our configuration.

 


> Flink JobManager failed to restart from checkpoint in kubernetes HA setup
> -------------------------------------------------------------------------
>
>                 Key: FLINK-21685
>                 URL: https://issues.apache.org/jira/browse/FLINK-21685
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.12.1, 1.12.2
>            Reporter: Peng Zhang
>            Priority: Major
>         Attachments: 03-config.yaml, 06-jobmanager-deployment.yaml, 
> 08-taskmanager-deployment.yaml, flink-ha.log
>
>
> We use Flink K8S session cluster with HA mode (1 JobManager and 4 
> TaskManagers). When jobs are running in Flink, and JobManager restarted, 
> Flink JobManager failed to recover job from checkpoint
> {code}
> 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to fetch 1 checkpoints from storage. 
> 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to fetch 1 checkpoints from storage. 
> 2021-03-08 13:16:42,962 INFO  
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
> Trying to retrieve checkpoint 1. 
> 2021-03-08 13:16:43,014 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring 
> job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for 
> 9a534b2e309b24f78866b65d94082ead located at 
> s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
>  
> 2021-03-08 13:16:43,023 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
> state to restore 
> 2021-03-08 13:16:43,024 INFO  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Using failover strategy 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
>  for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 
> 2021-03-08 13:16:43,046 INFO  
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - JobManager 
> runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead) 
> was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 
> at akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. 
> 2021-03-08 13:16:43,060 WARN  akka.remote.transport.netty.NettyTransport      
>              [] - Remote connection to [null] failed with 
> java.net.NoRouteToHostException: No route to host 
> 2021-03-08 13:16:43,060 WARN  akka.remote.ReliableDeliverySupervisor          
>              [] - Association with remote system 
> [akka.tcp://[email protected]:6123] has failed, address is now gated for 
> [50] ms. Reason: [Association failed with 
> [akka.tcp://[email protected]:6123]] Caused by: 
> [java.net.NoRouteToHostException: No route to host]
> {code}
> Attached is the log, and our configuration.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-21685) Flink JobManager failed to restart from checkpoint in kubernetes HA setup

Reply via email to