[
https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874795#comment-17874795
]
袁枫 commented on FLINK-33481:
----------------------------
had Fixed? [~hansonhe]
> Why were checkpoints stored on zookeeper deleted when JobManager failures
> with Flink High Availability on yarn
> --------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-33481
> URL: https://issues.apache.org/jira/browse/FLINK-33481
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.13.1
> Reporter: hansonhe
> Priority: Major
> Attachments: image-2023-11-08-09-40-59-889.png,
> image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png
>
>
> FlinkVersion: 1.13.1
> (1) flink-conf.yaml
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorum xxxxx
> state.checkpoint-storage filesystem
> state.checkpoints.dir hdfs://xxxxx
> (2) jobmanager
> application_1684323088373_1744
> jm_1: appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023
> jm_2: appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023
> (3) When appattempt_1684323088373_1744_000001 failures, I found
> 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c
> stored on hdfs is successful
> 3.2) Checkpoint stored in zookeper: /flink/application_1684323088373_1744
> was deleted
> the logs as following:
> !image-2023-11-08-10-05-54-694.png!
> !image-2023-11-08-09-40-59-889.png!
> (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch
> to start appattempt_1684323088373_1744_000002, the logs start as following:
> No checkpoint found during restore !image-2023-11-08-09-57-17-739.png!
> (5)My Question :
> 5.1)Why were checkpoints stored on zookeeper deleted when JobManager
> failures with Flink High Availability on yarn?It cause that Jobmanager run
> to restore without checkpoint found
> 5.2)Why not directly to use successful and completed checkpoint-5750
> stored on hdfs to restore when failed over to
> jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover
> from ZookeeperStateHandleStore firstly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)