Chesnay Schepler created FLINK-35159: ----------------------------------------
Summary: CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash Key: FLINK-35159 URL: https://issues.apache.org/jira/browse/FLINK-35159 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.18.0 Reporter: Chesnay Schepler Assignee: Chesnay Schepler Fix For: 1.18.2, 1.20.0, 1.19.1 When a task manager dies while the JM is generating an ExecutionGraph in the background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can transition back into WaitingForResources if the TM hosted one of the slots that we planned to use in {{tryToAssignSlots}}. At this point the ExecutionGraph was already transitioned to running, which implicitly kicks of periodic checkpointing by the CheckpointCoordinator, without the operator coordinator holders being initialized yet (as this happens after we assigned slots). This effectively leaks that CheckpointCoordinator, including the timer thread that will continue to try triggering checkpoints, which will naturally fail to trigger. This can cause a JM crash because it results in {{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which fails with an NPE since the {{mainThreadExecutor}} was not initialized yet. {code} java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NullPointerException at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707) at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.CompletionException: java.lang.NullPointerException at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ... 7 more Caused by: java.lang.NullPointerException at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388) at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ... 8 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)