[
https://issues.apache.org/jira/browse/FLINK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-35159:
-----------------------------------
Labels: pull-request-available (was: )
> CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash
> ------------------------------------------------------------------------
>
> Key: FLINK-35159
> URL: https://issues.apache.org/jira/browse/FLINK-35159
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.0
> Reporter: Chesnay Schepler
> Assignee: Chesnay Schepler
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> When a task manager dies while the JM is generating an ExecutionGraph in the
> background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can
> transition back into WaitingForResources if the TM hosted one of the slots
> that we planned to use in {{tryToAssignSlots}}.
> At this point the ExecutionGraph was already transitioned to running, which
> implicitly kicks of periodic checkpointing by the CheckpointCoordinator,
> without the operator coordinator holders being initialized yet (as this
> happens after we assigned slots).
> This effectively leaks that CheckpointCoordinator, including the timer thread
> that will continue to try triggering checkpoints, which will naturally fail
> to trigger.
> This can cause a JM crash because it results in
> {{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which
> fails with an NPE since the {{mainThreadExecutor}} was not initialized yet.
> {code}
> java.util.concurrent.CompletionException:
> java.util.concurrent.CompletionException: java.lang.NullPointerException
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
> at
> java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
> at
> java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
> at
> java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
> at
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
> at
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.CompletionException:
> java.lang.NullPointerException
> at
> java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> at
> java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> at
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
> at
> java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
> ... 7 more
> Caused by: java.lang.NullPointerException
> at
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
> at
> java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
> at
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> ... 8 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)