[jira] [Updated] (FLINK-38133) Unable to find checkpoint info during Flink jm pod restart on k8s,running mode is nativekubenetes,jobmanager created by deployment

jeremyMu (Jira) Tue, 22 Jul 2025 06:53:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-38133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


jeremyMu updated FLINK-38133:
-----------------------------
    Summary: Unable to find checkpoint info during Flink jm pod restart on 
k8s,running mode is nativekubenetes,jobmanager created by deployment  (was: 
Unable to find checkpoint status analysis during Flink jm pod restart on 
k8s,running mode is nativekubenetes,jobmanager created by deployment)

> Unable to find checkpoint info during Flink jm pod restart on k8s,running 
> mode is nativekubenetes,jobmanager created by deployment
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-38133
>                 URL: https://issues.apache.org/jira/browse/FLINK-38133
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.16.2
>            Reporter: jeremyMu
>            Priority: Major
>         Attachments: 486dc1fed2cf5b84922ede34d479015-1.png, 
> AgAABj35qkdRNkkxGZlEc4xVnl4UNi2l.png, 微信图片_20250722214623.png
>
>
> Before exiting abnormally, jm will clear the metadata information of ha 
> （metadata information such as checkpoint pointers）
> source code will delete ha configmap
> The previous abnormal failure and exit of jm resulted in the deletion of the 
> configmap, so the next restart failed to locate the ha configmap, and 
> consequently, the checkpoint information could not be found
> In actual business operations, the number of TM retries is configured (in 
> some business scenarios, the taskmanager will not retry indefinitely). If the 
> TM reaches the retry limit and fails to pull up the job normally, it will 
> cause the JM to crash. After the JM crashes, the metadata information stored 
> by HA will be cleared (check the logic in the source code). As a result, when 
> the JM automatically restarts, it cannot find the HA metadata information, 
> and thus cannot locate the most recent Checkpoint state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-38133) Unable to find checkpoint info during Flink jm pod restart on k8s,running mode is nativekubenetes,jobmanager created by deployment

Reply via email to