MartijnVisser opened a new pull request, #28315:
URL: https://github.com/apache/flink/pull/28315

   ## What is the purpose of the change
   
   `RestClient.close()` could leave a request that had already passed the 
connect phase with its terminal response future uncompleted, hanging the caller 
indefinitely. In-flight requests are tracked only via `responseChannelFutures`, 
which holds the connect-phase `CompletableFuture<Channel>`; the connect 
listener removes that entry the moment the TCP connection is established, 
before the request enters its in-flight (response) phase. A `close()` racing 
with such a request runs `notifyResponseFuturesOfShutdown()` over only 
`responseChannelFutures`, so the terminal future is never failed (the channel's 
`channelInactive` callback may not be dispatched once the event-loop group is 
torn down). This surfaced as a watchdog-killed surefire JVM on master for 
`RestClientTest.testRestClientClosedHandling` (see FLINK-39858). FLINK-39180 
previously treated the same symptom as a benign assertion-type mismatch, which 
held only for the connect phase.
   
   ## Brief change log
   
     - Track each request's terminal response future for its whole lifetime in 
a new `pendingRequestFutures` set, deregistered on natural completion.
     - `notifyResponseFuturesOfShutdown()` now also fails the entries in 
`pendingRequestFutures`, so `close()` fails in-flight requests in any phase.
     - Re-check `isRunning` after registration (failing only a future still 
atomically registered) to close the check-then-act race with a concurrent 
`close()`.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added 
`RestClientTest.testCloseFailsInFlightRequestFutureAfterConnectPhase`, which 
drives a request past the connect phase against a local server that accepts but 
never replies, then asserts `close()` fails the in-flight future (behavioral 
guard) and the in-flight tracking invariant, both with bounded waits so a 
regression fails rather than hangs. Verified it fails without the production 
change and passes with it.
     - Covered by existing `RestClientTest` (`testRestClientClosedHandling`, 
`testCloseClientWhileProcessingRequest`, 
`testResponseChannelFuturesResolvedExceptionallyOnClose`); full 
`RestClientTest` is green.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: `RestClient` shutdown 
now reliably fails in-flight requests instead of potentially leaking them; no 
recovery semantics change
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes (Claude Code (Claude Opus 4.8))
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to