[
https://issues.apache.org/jira/browse/FLINK-39879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086729#comment-18086729
]
Martijn Visser commented on FLINK-39879:
----------------------------------------
Root cause:
testCheckpointAckFailure waited on an unbounded stateUpdatedFuture.get(). The
test deliberately sets a tiny pekko.ask.timeout (250 ms) so the oversized
checkpoint-ACK RPC times out — that timeout is load-bearing for the
AskTimeoutException assertion. But 250 ms applies to every cluster RPC, so on a
slow CI agent an unrelated RPC (e.g. task deployment) times out and fails the
job terminally before the keyed state is updated. The state future then never
completes, the wait hangs, and the 900 s no-output watchdog SIGTERMs the whole
surefire fork.
Fix (test-side):
A terminally-failed job completes MiniClusterJobClient#getJobExecutionResult()
exceptionally (JobResult#toJobExecutionResult throws JobExecutionException,
wrapped by thenApply in a CompletionException). A whenComplete handler unwraps
that to the real cause and completes stateUpdatedFuture exceptionally, so the
wait fails fast instead of hanging. @Timeout(5, MINUTES) is the hard anti-hang
guard (both .get() calls stay intentionally unbounded; JUnit's @Timeout
interrupts the parked, interruptible managedBlock). Verified: happy path passes
in ~1.8 s; forced fail-fast path surfaces the clean JobExecutionException in
~24 s with no hang.
> CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure never
> CompletableFuture and hangs on Complete
> ---------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39879
> URL: https://issues.apache.org/jira/browse/FLINK-39879
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Reporter: Martijn Visser
> Priority: Major
>
> CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure blocks on
> CompletableFuture.join() with no per-test timeout, so when the future never
> completes the whole fork hangs
> {code:java}
> Jun 06 05:53:56 "ForkJoinPool-1-worker-1" #18 daemon prio=5 os_prio=0
> cpu=2299.88ms elapsed=3706.28s tid=0x00007fb86da86800 nid=0x59e07 waiting on
> condition [0x00007fb86b3f6000]
> Jun 06 05:53:56 java.lang.Thread.State: WAITING (parking)
> Jun 06 05:53:56 at
> jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> Jun 06 05:53:56 - parking to wait for <0x00000000a74001d8> (a
> java.util.concurrent.CompletableFuture$Signaller)
> Jun 06 05:53:56 at
> java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
> Jun 06 05:53:56 at
> java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1864)
> Jun 06 05:53:56 at
> java.util.concurrent.ForkJoinPool.compensatedBlock([email protected]/ForkJoinPool.java:3449)
> Jun 06 05:53:56 at
> java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3432)
> Jun 06 05:53:56 at
> java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1898)
> Jun 06 05:53:56 at
> java.util.concurrent.CompletableFuture.join([email protected]/CompletableFuture.java:2117)
> Jun 06 05:53:56 at
> org.apache.flink.test.checkpointing.CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure(CheckpointAcknowledgeFailureITCase.java:111)
> Jun 06 05:53:56 at
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native
> Method)
> Jun 06 05:53:56 at
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
> Jun 06 05:53:56 at
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
> Jun 06 05:53:56 at
> java.lang.reflect.Method.invoke([email protected]/Method.java:568)
> Jun 06 05:53:56 at
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:767)
> Jun 06 05:53:56 at
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75716&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=115e5c38-6efb-5006-4921-5e2851da71ef&l=8839
--
This message was sent by Atlassian Jira
(v8.20.10#820010)