Martijn Visser created FLINK-39858:
--------------------------------------
Summary: RestClient.close() can leave in-flight request futures
uncompleted, hanging the caller
Key: FLINK-39858
URL: https://issues.apache.org/jira/browse/FLINK-39858
Project: Flink
Issue Type: Bug
Components: Runtime / REST, Tests
Reporter: Martijn Visser
Assignee: Martijn Visser
{{RestClientTest.testRestClientClosedHandling}} hung intermittently in the
{{test_cron_hadoop313}} leg on master, where the surefire JVM produced no
output for 900s and was watchdog-killed.
Unlike a deterministic failure it only reproduces under load: the preceding
{{ForwardEdgesAdapterTest}} (100k invocations, ~531s) saturated the agent and
widened the race window.
The thread dump taken at the watchdog kill shows the test worker parked forever
on the request future:
{code:java}
"ForkJoinPool-295-worker-1"
java.lang.Thread.State: WAITING
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
at
org.apache.flink.core.testutils.FlinkCompletableFutureAssert.assertEventuallyFails(FlinkCompletableFutureAssert.java:161)
at
org.apache.flink.core.testutils.FlinkCompletableFutureAssert.eventuallyFailsWith(FlinkCompletableFutureAssert.java:135)
at
org.apache.flink.runtime.rest.RestClientTest.testRestClientClosedHandling(RestClientTest.java:257)
{code}
Root cause: {{RestClient}} tracks in-flight requests only via
{{responseChannelFutures}}, which holds each request's connect-phase
{{CompletableFuture}}. The connect listener removes that future the moment the
TCP connection is established, before the request enters its in-flight
(response) phase, so from then on the request is tracked by nothing. On
{{close()}}, {{notifyResponseFuturesOfShutdown()}} only fails the futures still
in {{responseChannelFutures}}. When {{close()}} races with a request that has
just passed the connect phase, the terminal response future is never completed
(the channel's {{channelInactive}} callback may not be dispatched once the
event-loop group is being torn down), so a caller blocking on it hangs
indefinitely.
FLINK-39180 previously treated the same test failure as a benign assertion-type
mismatch and assumed the future is always completed on close; that holds only
for the connect phase, not the in-flight phase, so the underlying defect
remained.
Solution: track the terminal per-request response future for its whole lifetime
in a dedicated set, fail those futures on close, and re-check {{isRunning}}
after registration (failing only a
future still atomically registered) to close the check-then-act race.
Failed CI build (Azure DevOps {{flink-ci.flink-master-mirror}}, 20260604.1):
https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=75618
--
This message was sent by Atlassian Jira
(v8.20.10#820010)