[ https://issues.apache.org/jira/browse/FLINK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jeremyMu updated FLINK-38133: ----------------------------- Summary: Unable to find checkpoint info during Flink jm pod restart on k8s,running mode is nativekubenetes,jobmanager created by deployment (was: Unable to find checkpoint status analysis during Flink jm pod restart on k8s,running mode is nativekubenetes,jobmanager created by deployment) > Unable to find checkpoint info during Flink jm pod restart on k8s,running > mode is nativekubenetes,jobmanager created by deployment > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-38133 > URL: https://issues.apache.org/jira/browse/FLINK-38133 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.16.2 > Reporter: jeremyMu > Priority: Major > Attachments: 486dc1fed2cf5b84922ede34d479015-1.png, > AgAABj35qkdRNkkxGZlEc4xVnl4UNi2l.png, 微信图片_20250722214623.png > > > Before exiting abnormally, jm will clear the metadata information of ha > (metadata information such as checkpoint pointers) > source code will delete ha configmap > The previous abnormal failure and exit of jm resulted in the deletion of the > configmap, so the next restart failed to locate the ha configmap, and > consequently, the checkpoint information could not be found > In actual business operations, the number of TM retries is configured (in > some business scenarios, the taskmanager will not retry indefinitely). If the > TM reaches the retry limit and fails to pull up the job normally, it will > cause the JM to crash. After the JM crashes, the metadata information stored > by HA will be cleared (check the logic in the source code). As a result, when > the JM automatically restarts, it cannot find the HA metadata information, > and thus cannot locate the most recent Checkpoint state -- This message was sent by Atlassian Jira (v8.20.10#820010)