[ https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936434#comment-17936434 ]
Matthias Pohl commented on FLINK-37483: --------------------------------------- Hi [~maxfeng], are you able to provide some logs? The scenario you're describing shouldn't actually happen because errors while the JobManagerRunner is created should only result in a local cleanup (no HA data is touched). Here's the [related code|https://github.com/apache/flink/blob/7ef73148981621e815c77f164f49e2be0065662c/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L803]. Looking into the logs of your failed run might help identify the issue. > Native kubernetes clusters losing checkpoint state on FAILED > ------------------------------------------------------------ > > Key: FLINK-37483 > URL: https://issues.apache.org/jira/browse/FLINK-37483 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.20.1 > Reporter: Max Feng > Priority: Major > > We're running Flink 1.20, native kubernetes application-mode clusters, and > we're running into an issue where clusters are restarting without checkpoints > from HA configmaps. > To the best of our understanding, here's what's happening: > 1) We're running application-mode clusters in native kubernetes with > externalized checkpoints, retained on cancellation. We're attempting to > restore a job from a checkpoint; the checkpoint reference is held in the > Kubernetes HA configmap. > 2) The jobmanager encounters an issue during startup, and the job goes to > state FAILED. > 3) The HA configmap containing the checkpoint reference is cleaned up. > 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod > is immediately restarted. > 5) Upon restart, the new Jobmanager finds no checkpoints to restore from. > We think this is a bad combination of the following behaviors: > * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes > mode > * FAILED does not actually stop a job in native kubernetes mode, instead it > is immediately retried -- This message was sent by Atlassian Jira (v8.20.10#820010)