[ 
https://issues.apache.org/jira/browse/FLINK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeremyMu updated FLINK-38133:
-----------------------------
    Description: 
Before exiting abnormally, jm will clear the metadata information of ha 
(metadata information such as checkpoint pointers)

source code will delete ha configmap

The previous abnormal failure and exit of jm resulted in the deletion of the 
configmap, so the next restart failed to locate the ha configmap, and 
consequently, the checkpoint information could not be found

In actual business operations, the number of TM retries is configured (in some 
business scenarios, the taskmanager will not retry indefinitely). If the TM 
reaches the retry limit and fails to pull up the job normally, it will cause 
the JM to crash. After the JM crashes, the metadata information stored by HA 
will be cleared (check the logic in the source code). As a result, when the JM 
automatically restarts, it cannot find the HA metadata information, and thus 
cannot locate the most recent Checkpoint state



  was:
Before exiting abnormally, jm will clear the metadata information of ha 
(metadata information such as checkpoint pointers)

source code will delete ha configmap

In actual business operations, the number of TM retries is configured (in some 
business scenarios, the taskmanager will not retry indefinitely). If the TM 
reaches the retry limit and fails to pull up the job normally, it will cause 
the JM to crash. After the JM crashes, the metadata information stored by HA 
will be cleared (check the logic in the source code). As a result, when the JM 
automatically restarts, it cannot find the HA metadata information, and thus 
cannot locate the most recent Checkpoint state




> Unable to find checkpoint status analysis during Flink restart on 
> k8s,jobmanager created by deployment
> ------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-38133
>                 URL: https://issues.apache.org/jira/browse/FLINK-38133
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.16.2
>            Reporter: jeremyMu
>            Priority: Major
>         Attachments: 486dc1fed2cf5b84922ede34d479015-1.png, 
> AgAABj35qkdRNkkxGZlEc4xVnl4UNi2l.png, 微信图片_20250722214623.png
>
>
> Before exiting abnormally, jm will clear the metadata information of ha 
> (metadata information such as checkpoint pointers)
> source code will delete ha configmap
> The previous abnormal failure and exit of jm resulted in the deletion of the 
> configmap, so the next restart failed to locate the ha configmap, and 
> consequently, the checkpoint information could not be found
> In actual business operations, the number of TM retries is configured (in 
> some business scenarios, the taskmanager will not retry indefinitely). If the 
> TM reaches the retry limit and fails to pull up the job normally, it will 
> cause the JM to crash. After the JM crashes, the metadata information stored 
> by HA will be cleared (check the logic in the source code). As a result, when 
> the JM automatically restarts, it cannot find the HA metadata information, 
> and thus cannot locate the most recent Checkpoint state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to