[
https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301360#comment-17301360
]
Yang Wang commented on FLINK-21685:
-----------------------------------
[~petrizhang] I think your analysis is right. The new leader did not write the
leader information to the ConfigMap successfully. This is the root cause why it
could not recover the jobs from latest successful checkpoints.
It is really strange that JobManager could still receive the "MODIFIED" event
for leader ConfigMap, which means the leader elector could update the
annotation successfully. Maybe you could double check the annotation of leader
ConfigMap for verification. However, we could not write the leader information
to the content of ConfigMap. And we also do not have any exceptions(e.g.
resource conflicts).
I am curious which version of Kubernetes you are using? Is it a standard
Kubernetes cluster or with some internal changes?
I have tested in the minikube and a real K8s cluster with version 1.18. It just
works well and I could not reproduce such issue.
> Flink JobManager failed to restart from checkpoint in kubernetes HA setup
> -------------------------------------------------------------------------
>
> Key: FLINK-21685
> URL: https://issues.apache.org/jira/browse/FLINK-21685
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.12.1, 1.12.2
> Reporter: Peng Zhang
> Priority: Major
> Attachments: 01-role.yaml, 02-role-binding.yaml, 03-config.yaml,
> 06-jobmanager-deployment.yaml, 08-taskmanager-deployment.yaml, flink-ha.log,
> scalyr-logs.txt.zip
>
>
> We use Flink K8S session cluster with HA mode (1 JobManager and 4
> TaskManagers). When jobs are running in Flink, and JobManager restarted,
> Flink JobManager failed to recover job from checkpoint
> {code}
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to fetch 1 checkpoints from storage.
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to fetch 1 checkpoints from storage.
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to retrieve checkpoint 1.
> 2021-03-08 13:16:43,014 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring
> job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for
> 9a534b2e309b24f78866b65d94082ead located at
> s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
>
> 2021-03-08 13:16:43,023 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master
> state to restore
> 2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [] - Using failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
> for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead).
> 2021-03-08 13:16:43,046 INFO
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager
> runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead)
> was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74
> at akka.tcp://[email protected]:6123/user/rpc/jobmanager_2.
> 2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport
> [] - Remote connection to [null] failed with
> java.net.NoRouteToHostException: No route to host
> 2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system
> [akka.tcp://[email protected]:6123] has failed, address is now gated for
> [50] ms. Reason: [Association failed with
> [akka.tcp://[email protected]:6123]] Caused by:
> [java.net.NoRouteToHostException: No route to host]
> {code}
> Attached is the log, and our configuration.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)