[
https://issues.apache.org/jira/browse/FLINK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann updated FLINK-21685:
----------------------------------
Description:
We use Flink K8S session cluster with HA mode (1 JobManager and 4
TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink
JobManager failed to recover job from checkpoint
{code}
2021-03-08 13:16:42,962 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying
to fetch 1 checkpoints from storage.
2021-03-08 13:16:42,962 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying
to fetch 1 checkpoints from storage.
2021-03-08 13:16:42,962 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying
to retrieve checkpoint 1.
2021-03-08 13:16:43,014 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for
9a534b2e309b24f78866b65d94082ead located at
s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
2021-03-08 13:16:43,023 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master
state to restore
2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster
[] - Using failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead).
2021-03-08 13:16:43,046 INFO
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager
runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead)
was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74 at
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2.
2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport
[] - Remote connection to [null] failed with
java.net.NoRouteToHostException: No route to host
2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor
[] - Association with remote system
[akka.tcp://[email protected]:6123] has failed, address is now gated for [50]
ms. Reason: [Association failed with [akka.tcp://[email protected]:6123]]
Caused by: [java.net.NoRouteToHostException: No route to host]
{code}
Attached is the log, and our configuration.
was:
We use Flink K8S session cluster with HA mode (1 JobManager and 4
TaskManagers). When jobs are running in Flink, and JobManager restarted, Flink
JobManager failed to recover job from checkpoint
{{2021-03-08 13:16:42,962 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying
to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying
to fetch 1 checkpoints from storage. 2021-03-08 13:16:42,962 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying
to retrieve checkpoint 1. 2021-03-08 13:16:43,014 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job
9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for
9a534b2e309b24f78866b65d94082ead located at
s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
2021-03-08 13:16:43,023 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master
state to restore 2021-03-08 13:16:43,024 INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Using
failover strategy
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead). 2021-03-08
13:16:43,046 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl
[] - JobManager runner for job BrandCollectionTrackingJob
(9a534b2e309b24f78866b65d94082ead) was granted leadership with session id
c258d8ce-69d3-49df-8bee-1b748d5bbe74 at
akka.tcp://[email protected]:6123/user/rpc/jobmanager_2. 2021-03-08
13:16:43,060 WARN akka.remote.transport.netty.NettyTransport
[] - Remote connection to [null] failed with java.net.NoRouteToHostException:
No route to host 2021-03-08 13:16:43,060 WARN
akka.remote.ReliableDeliverySupervisor [] - Association
with remote system [akka.tcp://[email protected]:6123] has failed, address is
now gated for [50] ms. Reason: [Association failed with
[akka.tcp://[email protected]:6123]] Caused by:
[java.net.NoRouteToHostException: No route to host] }}
Attached is the log, and our configuration.
> Flink JobManager failed to restart from checkpoint in kubernetes HA setup
> -------------------------------------------------------------------------
>
> Key: FLINK-21685
> URL: https://issues.apache.org/jira/browse/FLINK-21685
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.12.1, 1.12.2
> Reporter: Peng Zhang
> Priority: Major
> Attachments: 03-config.yaml, 06-jobmanager-deployment.yaml,
> 08-taskmanager-deployment.yaml, flink-ha.log
>
>
> We use Flink K8S session cluster with HA mode (1 JobManager and 4
> TaskManagers). When jobs are running in Flink, and JobManager restarted,
> Flink JobManager failed to recover job from checkpoint
> {code}
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to fetch 1 checkpoints from storage.
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to fetch 1 checkpoints from storage.
> 2021-03-08 13:16:42,962 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] -
> Trying to retrieve checkpoint 1.
> 2021-03-08 13:16:43,014 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring
> job 9a534b2e309b24f78866b65d94082ead from Checkpoint 1 @ 1615208258041 for
> 9a534b2e309b24f78866b65d94082ead located at
> s3a://zalando-stellar-flink-state-eu-central-1-staging/checkpoints/9a534b2e309b24f78866b65d94082ead/chk-1.
>
> 2021-03-08 13:16:43,023 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master
> state to restore
> 2021-03-08 13:16:43,024 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [] - Using failover strategy
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@58d927d2
> for BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead).
> 2021-03-08 13:16:43,046 INFO
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager
> runner for job BrandCollectionTrackingJob (9a534b2e309b24f78866b65d94082ead)
> was granted leadership with session id c258d8ce-69d3-49df-8bee-1b748d5bbe74
> at akka.tcp://[email protected]:6123/user/rpc/jobmanager_2.
> 2021-03-08 13:16:43,060 WARN akka.remote.transport.netty.NettyTransport
> [] - Remote connection to [null] failed with
> java.net.NoRouteToHostException: No route to host
> 2021-03-08 13:16:43,060 WARN akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system
> [akka.tcp://[email protected]:6123] has failed, address is now gated for
> [50] ms. Reason: [Association failed with
> [akka.tcp://[email protected]:6123]] Caused by:
> [java.net.NoRouteToHostException: No route to host]
> {code}
> Attached is the log, and our configuration.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)