Chesnay Schepler created FLINK-35159:
----------------------------------------
Summary: CreatingExecutionGraph can leak CheckpointCoordinator and
cause JM crash
Key: FLINK-35159
URL: https://issues.apache.org/jira/browse/FLINK-35159
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.18.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
Fix For: 1.18.2, 1.20.0, 1.19.1
When a task manager dies while the JM is generating an ExecutionGraph in the
background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can
transition back into WaitingForResources if the TM hosted one of the slots that
we planned to use in {{tryToAssignSlots}}.
At this point the ExecutionGraph was already transitioned to running, which
implicitly kicks of periodic checkpointing by the CheckpointCoordinator,
without the operator coordinator holders being initialized yet (as this happens
after we assigned slots).
This effectively leaks that CheckpointCoordinator, including the timer thread
that will continue to try triggering checkpoints, which will naturally fail to
trigger.
This can cause a JM crash because it results in
{{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which fails
with an NPE since the {{mainThreadExecutor}} was not initialized yet.
{code}
java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException: java.lang.NullPointerException
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
at
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
at
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
at
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.CompletionException:
java.lang.NullPointerException
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
at
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
at
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
... 7 more
Caused by: java.lang.NullPointerException
at
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at
java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
at
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
... 8 more
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)