Incidentally, I just spotted this commit: https://github.com/apache/httpcomponents-client/commit/99d4a5e081f31616a2558e72824ed8cf52198596
I didn't even know that Java had these socket options. This will be a big help, especially for Lambda: since TCP keep-alive is implemented by the kernel, it will continue to operate even while the JVM is suspended. This is why the public documentation advises the use of keepalive: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html On Tue, Aug 5, 2025 at 7:29 PM Ryan Schmitt <rschm...@apache.org> wrote: > Since rolling out 5.5 (which includes my change for > `validateAfterInactivity`, which was previously interpreted as a connection > TTL for async HTTP 1.1 connections), I've been getting reports of > intermittent request failures which appear to be caused by stale connection > reuse. The async client is actually much more vulnerable to this issue than > we realized, and I'm trying to understand why. > > First, let me state my understanding of how server-initiated connection > closure works. A FIN from the server wakes up the selector, which fires an > input event that is handled by the code in AbstractHttp1StreamDuplexer, > which "reads" the end-of-stream. This then initiates graceful shutdown of > the connection. I believe this works by the client closing its end of the > connection, which is an output event (the client has to send FIN, as well > as close_notify if TLS). Once this is handled, the duplexer transitions the > connection state to SHUTDOWN and closes the InternalDataChannel, which (for > a graceful shutdown) adds itself to the closedSessions queue. The IOReactor > processes this queue in processClosedSessions(), which will fail any > requests on the connection and remove it from the pool. > > The basic issue seems to be that since connection reuse does not go > through the IOReactor, there's a race condition between the IOReactor's > event loop (which drives all the processing and bookkeeping for remote > connection closure) and request execution (which can draw a doomed > connection from the connection pool). I wrote a test that sends requests to > a server, which sends a response and then immediately closes the connection > (without any `Connection: close` header). Part of the idea here is to > simulate an async client running in AWS Lambda, where the JVM is suspended > between invocations: if invocations are intermittent, connections will be > remotely closed while the JVM is suspended, and then during the next > invocation the IOReactor will race with the client request in just this way. > > What I found with this test setup is that sending requests one at a time > mostly works, but sometimes results in an exception (typically > RequestNotExecutedException). Setting a breakpoint on > PoolEntry::discardConnection, I can see that the client usually detects the > stale connection (the closure has been processed by the Duplexer): > > at >> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) >> at >> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.isConnected(PoolingAsyncClientConnectionManager.java:739) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.isEndpointConnected(InternalHttpAsyncExecRuntime.java:208) >> at >> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:157) >> at >> org.apache.hc.client5.http.impl.async.AsyncConnectExec$1.completed(AsyncConnectExec.java:153) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:128) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime$1.completed(InternalHttpAsyncExecRuntime.java:120) >> at >> org.apache.hc.core5.concurrent.BasicFuture.completed(BasicFuture.java:148) >> at >> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$3$1.leaseCompleted(PoolingAsyncClientConnectionManager.java:336) > > > But sometimes, the connection closure is not processed in time, in which > case I see this stack trace, where the IOReactor fails the request and > discards the conn pool entry: > > at >> org.apache.hc.core5.pool.PoolEntry.discardConnection(PoolEntry.java:170) >> at >> org.apache.hc.client5.http.impl.nio.PoolingAsyncClientConnectionManager$InternalConnectionEndpoint.close(PoolingAsyncClientConnectionManager.java:724) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:148) >> at >> org.apache.hc.client5.http.impl.async.InternalHttpAsyncExecRuntime.discardEndpoint(InternalHttpAsyncExecRuntime.java:180) >> at >> org.apache.hc.client5.http.impl.async.InternalAbstractHttpAsyncClient$2.failed(InternalAbstractHttpAsyncClient.java:363) >> at >> org.apache.hc.client5.http.impl.async.AsyncRedirectExec$1.failed(AsyncRedirectExec.java:261) >> at >> org.apache.hc.client5.http.impl.async.ContentCompressionAsyncExec$1.failed(ContentCompressionAsyncExec.java:160) >> at >> org.apache.hc.client5.http.impl.async.AsyncHttpRequestRetryExec$1.failed(AsyncHttpRequestRetryExec.java:203) >> at >> org.apache.hc.client5.http.impl.async.AsyncProtocolExec$1.failed(AsyncProtocolExec.java:297) >> at >> org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.failed(HttpAsyncMainClientExec.java:135) >> at >> org.apache.hc.core5.http.nio.command.RequestExecutionCommand.failed(RequestExecutionCommand.java:101) >> at >> org.apache.hc.core5.http.nio.command.CommandSupport.cancelCommands(CommandSupport.java:68) >> at >> org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onDisconnect(AbstractHttp1StreamDuplexer.java:415) >> at >> org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.disconnected(AbstractHttp1IOEventHandler.java:95) >> at >> org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.disconnected(ClientHttp1IOEventHandler.java:41) >> at >> org.apache.hc.core5.reactor.ssl.SSLIOSession$1.disconnected(SSLIOSession.java:253) >> at >> org.apache.hc.core5.reactor.InternalDataChannel.disconnected(InternalDataChannel.java:205) >> at >> org.apache.hc.core5.reactor.SingleCoreIOReactor.processClosedSessions(SingleCoreIOReactor.java:229) > > > Sending multiple requests in parallel (with a connection pool size of 1) > greatly exacerbates the race condition: _half_ of all requests fail. The > exceptions vary: RequestNotExecutedException, ConnectionClosedException > ("Connection closed by peer"), and SocketException (connection reset) are > all possible, but what reliably happens is that every other request fails > due to stale (closed) connection reuse. I've reproduced this with HTTP, > TLSv1.2, and TLSv1.3. > > What I'd like to know is: > > 1. Can we do anything to improve this race condition? Is there currently > any blocking work or any async commands that need to run before the conn > pool entry can be discarded, or the endpoint marked as disconnected? > 2. Can we do anything to detect stale connection reuse more reliably when > it does happen? Are there cases where we can surface this as > a RequestNotExecutedException where we currently aren't? > 3. Can the async client implement `validateAfterInactivity` for HTTP/1.1 > requests? I think we could actually implement this with a synchronization > barrier, rather than IO operations: we validate inactive connections by > waiting for the next iteration of the IOReactor event loop. This would > allow any backlog of IO events to be processed, preventing the client from > racing with itself. >