[ https://issues.apache.org/jira/browse/FLINK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler closed FLINK-35159. ------------------------------------ Resolution: Fixed > CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash > ------------------------------------------------------------------------ > > Key: FLINK-35159 > URL: https://issues.apache.org/jira/browse/FLINK-35159 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.18.0 > Reporter: Chesnay Schepler > Assignee: Chesnay Schepler > Priority: Major > Labels: pull-request-available > Fix For: 1.18.2, 1.20.0, 1.19.1 > > > When a task manager dies while the JM is generating an ExecutionGraph in the > background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can > transition back into WaitingForResources if the TM hosted one of the slots > that we planned to use in {{tryToAssignSlots}}. > At this point the ExecutionGraph was already transitioned to running, which > implicitly kicks of periodic checkpointing by the CheckpointCoordinator, > without the operator coordinator holders being initialized yet (as this > happens after we assigned slots). > This effectively leaks that CheckpointCoordinator, including the timer thread > that will continue to try triggering checkpoints, which will naturally fail > to trigger. > This can cause a JM crash because it results in > {{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which > fails with an NPE since the {{mainThreadExecutor}} was not initialized yet. > {code} > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707) > at > java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) > at > java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610) > at > java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910) > at > java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.util.concurrent.CompletionException: > java.lang.NullPointerException > at > java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) > at > java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932) > at > java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > ... 7 more > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1541) > at > java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > ... 8 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)