[ 
https://issues.apache.org/jira/browse/FLINK-39879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086729#comment-18086729
 ] 

Martijn Visser commented on FLINK-39879:
----------------------------------------

Root cause:

testCheckpointAckFailure waited on an unbounded stateUpdatedFuture.get(). The 
test deliberately sets a tiny pekko.ask.timeout (250 ms) so the oversized 
checkpoint-ACK RPC times out — that timeout is load-bearing for the 
AskTimeoutException assertion. But 250 ms applies to every cluster RPC, so on a 
slow CI agent an unrelated RPC (e.g. task deployment) times out and fails the 
job terminally before the keyed state is updated. The state future then never 
completes, the wait hangs, and the 900 s no-output watchdog SIGTERMs the whole 
surefire fork.

Fix (test-side):
A terminally-failed job completes MiniClusterJobClient#getJobExecutionResult() 
exceptionally (JobResult#toJobExecutionResult throws JobExecutionException, 
wrapped by thenApply in a CompletionException). A whenComplete handler unwraps 
that to the real cause and completes stateUpdatedFuture exceptionally, so the 
wait fails fast instead of hanging. @Timeout(5, MINUTES) is the hard anti-hang 
guard (both .get() calls stay intentionally unbounded; JUnit's @Timeout 
interrupts the parked, interruptible managedBlock). Verified: happy path passes 
in ~1.8 s; forced fail-fast path surfaces the clean JobExecutionException in 
~24 s with no hang.

> CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure never 
> CompletableFuture and hangs on Complete
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39879
>                 URL: https://issues.apache.org/jira/browse/FLINK-39879
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>            Reporter: Martijn Visser
>            Priority: Major
>
> CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure blocks on 
> CompletableFuture.join() with no per-test timeout, so when the future never 
> completes the whole fork hangs
> {code:java}
> Jun 06 05:53:56 "ForkJoinPool-1-worker-1" #18 daemon prio=5 os_prio=0 
> cpu=2299.88ms elapsed=3706.28s tid=0x00007fb86da86800 nid=0x59e07 waiting on 
> condition  [0x00007fb86b3f6000]
> Jun 06 05:53:56    java.lang.Thread.State: WAITING (parking)
> Jun 06 05:53:56       at 
> jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> Jun 06 05:53:56       - parking to wait for  <0x00000000a74001d8> (a 
> java.util.concurrent.CompletableFuture$Signaller)
> Jun 06 05:53:56       at 
> java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
> Jun 06 05:53:56       at 
> java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1864)
> Jun 06 05:53:56       at 
> java.util.concurrent.ForkJoinPool.compensatedBlock([email protected]/ForkJoinPool.java:3449)
> Jun 06 05:53:56       at 
> java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3432)
> Jun 06 05:53:56       at 
> java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1898)
> Jun 06 05:53:56       at 
> java.util.concurrent.CompletableFuture.join([email protected]/CompletableFuture.java:2117)
> Jun 06 05:53:56       at 
> org.apache.flink.test.checkpointing.CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure(CheckpointAcknowledgeFailureITCase.java:111)
> Jun 06 05:53:56       at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke0([email protected]/Native 
> Method)
> Jun 06 05:53:56       at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke([email protected]/NativeMethodAccessorImpl.java:77)
> Jun 06 05:53:56       at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
> Jun 06 05:53:56       at 
> java.lang.reflect.Method.invoke([email protected]/Method.java:568)
> Jun 06 05:53:56       at 
> org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:767)
> Jun 06 05:53:56       at 
> org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75716&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=115e5c38-6efb-5006-4921-5e2851da71ef&l=8839



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to