[
https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299440#comment-17299440
]
Peng Zhang commented on FLINK-21685:
------------------------------------
[~trohrmann] We have a company wide K8S cluster, where we set up Flink cluster
HA. We wanner to set up one Flink JobManager and 4 TaskManagers, and we run
Flink in session cluster mode. And we submit jobs to Flink via Rest API. When
no jobs are not running in Flink, restarting Flink JobManager can recover.
However, when jobs are running in Flink, after I deleted Flink JobManager pod
by using `kubectl delete pod <jobmanager-pod-id>`, a new JobManager pod is
started, but then Flink cannot recover properly.
I found that `stellar-flink-cluster-resourcemanager-leader` is not updated to
use the IP address to the new JobManager pod. Then, TaskManagers cannot find
the new JobManager. However, it is unclear why the cluster is not updated to
use the new JobManager.
I saw this issue
[https://stackoverflow.com/questions/66219093/flink-fencing-errors-in-k8-ha-mode/66228073#66228073]
and tried to configure via `FLINK_PROPERTIES` while not using configmap for
flink-config.yaml. But this does not help.
[~fly_in_gis] I am trying your example, and will let you know the result.
Meanwhile, your example starts Flink as `standalone-job` and our case just
start Flink in standalone session mode. I wonder if this would matter, and
would you be able to try to start Flink in standalone session mode and submit a
job, and then delete the JobManager pod (which will restart a new JobManager),
and see if there is an issue. Thanks!
> Flink JobManager failed to restart from checkpoint in kubernetes HA setup
> -------------------------------------------------------------------------
>
> Key: FLINK-21685
> URL: https://issues.apache.org/jira/browse/FLINK-21685
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.12.1, 1.12.2
> Reporter: Peng Zhang
> Priority: Major
> Attachments: 03-config.yaml, 06-jobmanager-deployment.yaml,
> 08-taskmanager-deployment.yaml, flink-ha.log
>
>
> We use Flink K8S session cluster with HA mode (1 JobManager and 4
> TaskManagers). When jobs are running in Flink, and JobManager restarted,
> Flink JobManager failed to recover job from checkpoint
> {code}
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to fetch 1 checkpoints from storage.
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to fetch 1 checkpoints from storage.
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to retrieve checkpoint 1.
> 2021-03-08 13:16:43,014 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring
> job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for
> 9a534b2e309b24f78866b65d94082ead located at
> s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
>
> 2021-03-08 13:16:43,023 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master
> state to restore
> 2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [] - Using failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
> for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead).
> 2021-03-08 13:16:43,046 INFO
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager
> runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead)
> was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74
> at akka.tcp://[email protected]:6123/user/rpc/jobmanager_2.
> 2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport
> [] - Remote connection to [null] failed with
> java.net.NoRouteToHostException: No route to host
> 2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system
> [akka.tcp://[email protected]:6123] has failed, address is now gated for
> [50] ms. Reason: [Association failed with
> [akka.tcp://[email protected]:6123]] Caused by:
> [java.net.NoRouteToHostException: No route to host]
> {code}
> Attached is the log, and our configuration.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)