[ https://issues.apache.org/jira/browse/FLINK-37483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936547#comment-17936547 ]
Max Feng commented on FLINK-37483: ---------------------------------- Here's the trace of the failure. I'll need to reproduce it again to get full logs from the attempt. {code:java} Job 00000000000000000000000000000000 reached terminal state FAILED. org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1773) at org.apache.flink.util.MdcUtils.lambda$wrapRunnable$1(MdcUtils.java:67) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: There is no operator for the state ad8761465be643c10db5fae153b87f68 at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770) ... 4 more Caused by: java.lang.IllegalStateException: There is no operator for the state ad8761465be643c10db5fae153b87f68 at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.checkStateMappingCompleteness(StateAssignmentOperation.java:769) at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:101) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1829) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1749) at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:210) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:382) at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:225) at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:142) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:162) at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:121) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:406) at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:383) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:128) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:100) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ... 4 more {code} > Native kubernetes clusters losing checkpoint state on FAILED > ------------------------------------------------------------ > > Key: FLINK-37483 > URL: https://issues.apache.org/jira/browse/FLINK-37483 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.20.1 > Reporter: Max Feng > Priority: Major > > We're running Flink 1.20, native kubernetes application-mode clusters, and > we're running into an issue where clusters are restarting without checkpoints > from HA configmaps. > To the best of our understanding, here's what's happening: > 1) We're running application-mode clusters in native kubernetes with > externalized checkpoints, retained on cancellation. We're attempting to > restore a job from a checkpoint; the checkpoint reference is held in the > Kubernetes HA configmap. > 2) The jobmanager encounters an issue during startup, and the job goes to > state FAILED. > 3) The HA configmap containing the checkpoint reference is cleaned up. > 4) The Kubernetes pod exits. Because it is a Kubernetes deployment, the pod > is immediately restarted. > 5) Upon restart, the new Jobmanager finds no checkpoints to restore from. > We think this is a bad combination of the following behaviors: > * FAILED triggers cleanup, which cleans up HA configmaps in native kubernetes > mode > * FAILED does not actually stop a job in native kubernetes mode, instead it > is immediately retried -- This message was sent by Atlassian Jira (v8.20.10#820010)