[jira] [Commented] (FLINK-37483) Native kubernetes clusters losing checkpoint state on FAILED

Matthias Pohl (Jira) Wed, 23 Jul 2025 08:45:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18009309#comment-18009309
 ]


Matthias Pohl commented on FLINK-37483:
---------------------------------------

Ah ok, it took me a bit to get this right - so, essentially, Flink is behaving 
as expected for the cleanup: The job is submitted and an error occurs during 
initialization which triggers the cleanup globally (to remove all the state 
that was generated during job submission).

Have you looked into the parameters that were introduced with FLINK-25715 in 
combination with 
[job-result-store-delete-on-commit|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#job-result-store-delete-on-commit]
 to handle failures in Application Mode?

> Native kubernetes clusters losing checkpoint state on FAILED
> ------------------------------------------------------------
>
>                 Key: FLINK-37483
>                 URL: https://issues.apache.org/jira/browse/FLINK-37483
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.20.1
>            Reporter: Max Feng
>            Priority: Major
>
> We're running Flink 1.20, native kubernetes application-mode clusters, and 
> we're running into an issue where clusters are restarting without checkpoints 
> from HA configmaps.
> To the best of our understanding, here's what's happening: 
> 1) We're running application-mode clusters in native kubernetes with 
> externalized checkpoints, retained on cancellation. We're attempting to 
> restore a job from a checkpoint; the checkpoint reference is held in the 
> Kubernetes HA configmap. 
> 2) The jobmanager encounters an issue during startup, and the job goes to 
> state FAILED.
> 3) The HA configmap containing the checkpoint reference is cleaned up.
> 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod 
> is immediately restarted. 
> 5) Upon restart, the new Jobmanager finds no checkpoints to restore from.
> We think this is a bad combination of the following behaviors:
> * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes 
> mode
> * FAILED does not actually stop a job in native kubernetes mode, instead it 
> is immediately retried



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37483) Native kubernetes clusters losing checkpoint state on FAILED

Reply via email to