[ https://issues.apache.org/jira/browse/FLINK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
jeremyMu updated FLINK-38133: ----------------------------- Description: Before exiting abnormally, jm will clear the metadata information of ha (metadata information such as checkpoint pointers) source code will delete ha configmap The previous abnormal failure and exit of jm resulted in the deletion of the configmap, so the next restart failed to locate the ha configmap, and consequently, the checkpoint information could not be found In actual business operations, the number of TM retries is configured (in some business scenarios, the taskmanager will not retry indefinitely). If the TM reaches the retry limit and fails to pull up the job normally, it will cause the JM to crash. After the JM crashes, the metadata information stored by HA will be cleared (check the logic in the source code). As a result, when the JM automatically restarts, it cannot find the HA metadata information, and thus cannot locate the most recent Checkpoint state was: Before exiting abnormally, jm will clear the metadata information of ha (metadata information such as checkpoint pointers) source code will delete ha configmap In actual business operations, the number of TM retries is configured (in some business scenarios, the taskmanager will not retry indefinitely). If the TM reaches the retry limit and fails to pull up the job normally, it will cause the JM to crash. After the JM crashes, the metadata information stored by HA will be cleared (check the logic in the source code). As a result, when the JM automatically restarts, it cannot find the HA metadata information, and thus cannot locate the most recent Checkpoint state > Unable to find checkpoint status analysis during Flink restart on > k8s,jobmanager created by deployment > ------------------------------------------------------------------------------------------------------ > > Key: FLINK-38133 > URL: https://issues.apache.org/jira/browse/FLINK-38133 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.16.2 > Reporter: jeremyMu > Priority: Major > Attachments: 486dc1fed2cf5b84922ede34d479015-1.png, > AgAABj35qkdRNkkxGZlEc4xVnl4UNi2l.png, 微信图片_20250722214623.png > > > Before exiting abnormally, jm will clear the metadata information of ha > (metadata information such as checkpoint pointers) > source code will delete ha configmap > The previous abnormal failure and exit of jm resulted in the deletion of the > configmap, so the next restart failed to locate the ha configmap, and > consequently, the checkpoint information could not be found > In actual business operations, the number of TM retries is configured (in > some business scenarios, the taskmanager will not retry indefinitely). If the > TM reaches the retry limit and fails to pull up the job normally, it will > cause the JM to crash. After the JM crashes, the metadata information stored > by HA will be cleared (check the logic in the source code). As a result, when > the JM automatically restarts, it cannot find the HA metadata information, > and thus cannot locate the most recent Checkpoint state -- This message was sent by Atlassian Jira (v8.20.10#820010)