[jira] [Commented] (FLINK-37483) Native kubernetes clusters losing checkpoint state on FAILED

Max Feng (Jira) Tue, 18 Mar 2025 09:17:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936547#comment-17936547
 ]


Max Feng commented on FLINK-37483:
----------------------------------

Here's the trace of the failure. I'll need to reproduce it again to get full 
logs from the attempt.
{code:java}
Job 00000000000000000000000000000000 reached terminal state FAILED.
org.apache.flink.runtime.client.JobInitializationException: Could not start the 
JobMaster.
        at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
        at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773)
        at 
org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:67)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalStateException: There is no operator for the state 
ad8761465be643c10db5fae153b87f68
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
        at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
        ... 4 more
Caused by: java.lang.IllegalStateException: There is no operator for the state 
ad8761465be643c10db5fae153b87f68
        at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.checkStateMappingCompleteness(StateAssignmentOperation.java:769)
        at 
org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:101)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1829)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1749)
        at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:210)
        at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:382)
        at 
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:225)
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:142)
        at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:162)
        at 
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:121)
        at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:406)
        at 
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:383)
        at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:128)
        at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:100)
        at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
        at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
        ... 4 more
{code}

> Native kubernetes clusters losing checkpoint state on FAILED
> ------------------------------------------------------------
>
>                 Key: FLINK-37483
>                 URL: https://issues.apache.org/jira/browse/FLINK-37483
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.20.1
>            Reporter: Max Feng
>            Priority: Major
>
> We're running Flink 1.20, native kubernetes application-mode clusters, and 
> we're running into an issue where clusters are restarting without checkpoints 
> from HA configmaps.
> To the best of our understanding, here's what's happening: 
> 1) We're running application-mode clusters in native kubernetes with 
> externalized checkpoints, retained on cancellation. We're attempting to 
> restore a job from a checkpoint; the checkpoint reference is held in the 
> Kubernetes HA configmap. 
> 2) The jobmanager encounters an issue during startup, and the job goes to 
> state FAILED.
> 3) The HA configmap containing the checkpoint reference is cleaned up.
> 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod 
> is immediately restarted. 
> 5) Upon restart, the new Jobmanager finds no checkpoints to restore from.
> We think this is a bad combination of the following behaviors:
> * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes 
> mode
> * FAILED does not actually stop a job in native kubernetes mode, instead it 
> is immediately retried



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37483) Native kubernetes clusters losing checkpoint state on FAILED

Reply via email to