[jira] [Created] (FLINK-35159) CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash

Chesnay Schepler (Jira) Thu, 18 Apr 2024 06:13:10 -0700

Chesnay Schepler created FLINK-35159:
----------------------------------------


             Summary: CreatingExecutionGraph can leak CheckpointCoordinator and 
cause JM crash
                 Key: FLINK-35159
                 URL: https://issues.apache.org/jira/browse/FLINK-35159
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.18.0
            Reporter: Chesnay Schepler
            Assignee: Chesnay Schepler
             Fix For: 1.18.2, 1.20.0, 1.19.1


When a task manager dies while the JM is generating an ExecutionGraph in the 
background then {{CreatingExecutionGraph#handleExecutionGraphCreation}} can 
transition back into WaitingForResources if the TM hosted one of the slots that 
we planned to use in {{tryToAssignSlots}}.

At this point the ExecutionGraph was already transitioned to running, which 
implicitly kicks of periodic checkpointing by the CheckpointCoordinator, 
without the operator coordinator holders being initialized yet (as this happens 
after we assigned slots).

This effectively leaks that CheckpointCoordinator, including the timer thread 
that will continue to try triggering checkpoints, which will naturally fail to 
trigger.
This can cause a JM crash because it results in 
{{OperatorCoordinatorHolder#abortCurrentTriggering}} to be called, which fails 
with an NPE since the {{mainThreadExecutor}} was not initialized yet.

{code}
java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: java.lang.NullPointerException
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
        at 
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
        at 
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
        at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
        at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.CompletionException: 
java.lang.NullPointerException
        at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
        at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
        at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
        at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
        ... 7 more
Caused by: java.lang.NullPointerException
        at 
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
        at 
java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
        at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
        ... 8 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35159) CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash

Reply via email to