Matthias created FLINK-24789:
--------------------------------
Summary: IllegalStateException with CheckpointCleaner being closed
already
Key: FLINK-24789
URL: https://issues.apache.org/jira/browse/FLINK-24789
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing, Runtime / Coordination
Affects Versions: 1.14.1
Reporter: Matthias
We experienced a failure of {{OperatorCoordinatorSchedulerTest}} in our VVP
Fork of Flink. The {{finegrained_resource_management}} test run failed with an
non-0 exit code:
{code}
Nov 01 17:19:12 [ERROR] Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on
project flink-runtime: There are test failures.
Nov 01 17:19:12 [ERROR]
Nov 01 17:19:12 [ERROR] Please refer to
/__w/1/s/flink-runtime/target/surefire-reports for the individual test results.
Nov 01 17:19:12 [ERROR] Please refer to dump files (if any exist) [date].dump,
[date]-jvmRun[N].dump and [date].dumpstream.
Nov 01 17:19:12 [ERROR] ExecutionException The forked VM terminated without
properly saying goodbye. VM crash or System.exit called?
Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m
-Dmvn.forkNumber=2 -XX:+UseG1GC -jar
/__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar
/__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2
surefire6448660128033443499tmp surefire_4131168043975619749001tmp
Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log
Nov 01 17:19:12 [ERROR] Process Exit Code: 239
Nov 01 17:19:12 [ERROR] Crashed tests:
Nov 01 17:19:12 [ERROR]
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest
Nov 01 17:19:12 [ERROR]
org.apache.maven.surefire.booter.SurefireBooterForkException:
ExecutionException The forked VM terminated without properly saying goodbye. VM
crash or System.exit called?
Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m
-Dmvn.forkNumber=2 -XX:+UseG1GC -jar
/__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar
/__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2
surefire6448660128033443499tmp surefire_4131168043975619749001tmp
Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log
Nov 01 17:19:12 [ERROR] Process Exit Code: 239
Nov 01 17:19:12 [ERROR] Crashed tests:
Nov 01 17:19:12 [ERROR]
org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest
Nov 01 17:19:12 [ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510)
Nov 01 17:19:12 [ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:457)
{code}
It looks like the {{testSnapshotAsyncFailureFailsCheckpoint}} caused it even
though finishing successfully due to a fatal error when shutting down the
cluster:
{code}
17:07:27,264 [ Checkpoint Timer] ERROR
org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread
'Checkpoint Timer' produced an uncaught exception. Stopping the process...
java.util.concurrent.CompletionException:
java.util.concurrent.CompletionException: java.lang.IllegalStateException:
CheckpointsCleaner has already been closed
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:626)
~[classes/:?]
at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
~[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
~[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:814)
[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
[?:1.8.0_292]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_292]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_292]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_292]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[?:1.8.0_292]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_292]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.util.concurrent.CompletionException:
java.lang.IllegalStateException: CheckpointsCleaner has already been closed
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
~[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
~[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838)
~[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_292]
... 8 more
Caused by: java.lang.IllegalStateException: CheckpointsCleaner has already been
closed
at
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
~[flink-core-1.14-stream-SNAPSHOT.jar:1.14-stream-SNAPSHOT]
at
org.apache.flink.runtime.checkpoint.CheckpointsCleaner.incrementNumberOfCheckpointsToClean(CheckpointsCleaner.java:105)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanup(CheckpointsCleaner.java:87)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanCheckpoint(CheckpointsCleaner.java:62)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.PendingCheckpoint.dispose(PendingCheckpoint.java:573)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:551)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1939)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1926)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:910)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:875)
~[classes/:?]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$6(CheckpointCoordinator.java:614)
~[classes/:?]
at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_292]
at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_292]
... 8 more
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)