Since rolling out 5.5 (which includes my change for `validateAfterInactivity`, which was previously interpreted as a connection TTL for async HTTP 1.1 connections), I've been getting reports of intermittent request failures which appear to be caused by stale connection reuse. The async client is actually much more vulnerable to this issue than we realized, and I'm trying to understand why.
First, let me state my understanding of how server-initiated connection closure works. A FIN from the server wakes up the selector, which fires an input event that is handled by the code in AbstractHttp1StreamDuplexer, which "reads" the end-of-stream. This then initiates graceful shutdown of the connection. I believe this works by the client closing its end of the connection, which is an output event (the client has to send FIN, as well as close_notify if TLS). Once this is handled, the duplexer transitions the connection state to SHUTDOWN and closes the InternalDataChannel, which (for a graceful shutdown) adds itself to the closedSessions queue. The IOReactor processes this queue in processClosedSessions(), which will fail any requests on the connection and remove it from the pool. The basic issue seems to be that since connection reuse does not go through the IOReactor, there's a race condition between the IOReactor's event loop (which drives all the processing and bookkeeping for remote connection closure) and request execution (which can draw a doomed connection from the connection pool). I wrote a test that sends requests to a server, which sends a response and then immediately closes the connection (without any `Connection: close` header). Part of the idea here is to simulate an async client running in AWS Lambda, where the JVM is suspended between invocations: if invocations are intermittent, connections will be remotely closed while the JVM is suspended, and then during the next invocation the IOReactor will race with the client request in just this way. What I found with this test setup is that sending requests one at a time mostly works, but sometimes results in an exception (typically RequestNotExecutedException). Setting a breakpoint on PoolEntry::discardConnection, I can see that the client usually detects the stale connection (the closure has been processed by the Duplexer): at > org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) > at > org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739) > at > org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208) > at > org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157) > at > org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153) > at > org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128) > at > org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120) > at > org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148) > at > org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336) But sometimes, the connection closure is not processed in time, in which case I see this stack trace, where the IOReactor fails the request and discards the conn pool entry: at > org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) > at > org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724) > at > org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148) > at > org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180) > at > org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363) > at > org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261) > at > org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160) > at > org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203) > at > org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297) > at > org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135) > at > org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101) > at > org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSupport.java:68) > at > org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onDisconnect(AbstractHttp1StreamDuplexer.java:415) > at > org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.disconnected(AbstractHttp1IOEventHandler.java:95) > at > org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.disconnected(ClientHttp1IOEventHandler.java:41) > at > org.apache.hc.core5.reactor.ssl.SSLIOSession$1.disconnected(SSLIOSession.java:253) > at > org.apache.hc.core5.reactor.InternalDataChannel.disconnected(InternalDataChannel.java:205) > at > org.apache.hc.core5.reactor.SingleCoreIOReactor.processClosedSessions(SingleCoreIOReactor.java:229) Sending multiple requests in parallel (with a connection pool size of 1) greatly exacerbates the race condition: _half_ of all requests fail. The exceptions vary: RequestNotExecutedException, ConnectionClosedException ("Connection closed by peer"), and SocketException (connection reset) are all possible, but what reliably happens is that every other request fails due to stale (closed) connection reuse. I've reproduced this with HTTP, TLSv1.2, and TLSv1.3. What I'd like to know is: 1. Can we do anything to improve this race condition? Is there currently any blocking work or any async commands that need to run before the conn pool entry can be discarded, or the endpoint marked as disconnected? 2. Can we do anything to detect stale connection reuse more reliably when it does happen? Are there cases where we can surface this as a RequestNotExecutedException where we currently aren't? 3. Can the async client implement `validateAfterInactivity` for HTTP/1.1 requests? I think we could actually implement this with a synchronization barrier, rather than IO operations: we validate inactive connections by waiting for the next iteration of the IOReactor event loop. This would allow any backlog of IO events to be processed, preventing the client from racing with itself.