[ 
https://issues.apache.org/jira/browse/FLINK-39858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39858:
-----------------------------------
    Labels: pull-request-available test-stability  (was: test-stability)

> RestClient.close() can leave in-flight request futures uncompleted, hanging 
> the caller
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-39858
>                 URL: https://issues.apache.org/jira/browse/FLINK-39858
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST, Tests
>            Reporter: Martijn Visser
>            Assignee: Martijn Visser
>            Priority: Major
>              Labels: pull-request-available, test-stability
>
> {{RestClientTest.testRestClientClosedHandling}} hung intermittently in the 
> {{test_cron_hadoop313}} leg on master, where the surefire JVM produced no 
> output for 900s and was watchdog-killed.
> Unlike a deterministic failure it only reproduces under load: the preceding 
> {{ForwardEdgesAdapterTest}} (100k invocations, ~531s) saturated the agent and 
> widened the race window.
> The thread dump taken at the watchdog kill shows the test worker parked 
> forever on the request future:
> {code:java}
> "ForkJoinPool-295-worker-1"
>    java.lang.Thread.State: WAITING
>         at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
>         at 
> org.apache.flink.core.testutils.FlinkCompletableFutureAssert.assertEventuallyFails(FlinkCompletableFutureAssert.java:161)
>         at 
> org.apache.flink.core.testutils.FlinkCompletableFutureAssert.eventuallyFailsWith(FlinkCompletableFutureAssert.java:135)
>         at 
> org.apache.flink.runtime.rest.RestClientTest.testRestClientClosedHandling(RestClientTest.java:257)
> {code}
> Root cause: {{RestClient}} tracks in-flight requests only via 
> {{responseChannelFutures}}, which holds each request's connect-phase 
> {{CompletableFuture}}. The connect listener removes that future the moment 
> the TCP connection is established, before the request enters its in-flight 
> (response) phase, so from then on the request is tracked by nothing. On 
> {{close()}}, {{notifyResponseFuturesOfShutdown()}} only fails the futures 
> still in {{responseChannelFutures}}. When {{close()}} races with a request 
> that has just passed the connect phase, the terminal response future is never 
> completed (the channel's {{channelInactive}} callback may not be dispatched 
> once the event-loop group is being torn down), so a caller blocking on it 
> hangs indefinitely.
> FLINK-39180 previously treated the same test failure as a benign 
> assertion-type mismatch and assumed the future is always completed on close; 
> that holds only for the connect phase, not the in-flight phase, so the 
> underlying defect remained.
> Solution: track the terminal per-request response future for its whole 
> lifetime in a dedicated set, fail those futures on close, and re-check 
> {{isRunning}} after registration (failing only a
> future still atomically registered) to close the check-then-act race.
> Failed CI build (Azure DevOps {{flink-ci.flink-master-mirror}}, 20260604.1): 
> https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75618



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to