[jira] [Commented] (FLINK-37483) Native kubernetes clusters losing checkpoint state on FAILED

Matthias Pohl (Jira) Tue, 18 Mar 2025 02:47:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936434#comment-17936434
 ]


Matthias Pohl commented on FLINK-37483:
---------------------------------------

Hi [~maxfeng], are you able to provide some logs? The scenario you're 
describing shouldn't actually happen because errors while the JobManagerRunner 
is created should only result in a local cleanup (no HA data is touched). 
Here's the [related 
code|https://github.com/apache/flink/blob/7ef73148981621e815c77f164f49e2be0065662c/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L803].

Looking into the logs of your failed run might help identify the issue.

> Native kubernetes clusters losing checkpoint state on FAILED
> ------------------------------------------------------------
>
>                 Key: FLINK-37483
>                 URL: https://issues.apache.org/jira/browse/FLINK-37483
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.20.1
>            Reporter: Max Feng
>            Priority: Major
>
> We're running Flink 1.20, native kubernetes application-mode clusters, and 
> we're running into an issue where clusters are restarting without checkpoints 
> from HA configmaps.
> To the best of our understanding, here's what's happening: 
> 1) We're running application-mode clusters in native kubernetes with 
> externalized checkpoints, retained on cancellation. We're attempting to 
> restore a job from a checkpoint; the checkpoint reference is held in the 
> Kubernetes HA configmap. 
> 2) The jobmanager encounters an issue during startup, and the job goes to 
> state FAILED.
> 3) The HA configmap containing the checkpoint reference is cleaned up.
> 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod 
> is immediately restarted. 
> 5) Upon restart, the new Jobmanager finds no checkpoints to restore from.
> We think this is a bad combination of the following behaviors:
> * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes 
> mode
> * FAILED does not actually stop a job in native kubernetes mode, instead it 
> is immediately retried



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37483) Native kubernetes clusters losing checkpoint state on FAILED

Reply via email to