https://bz.apache.org/bugzilla/show_bug.cgi?id=68884

            Bug ID: 68884
           Summary: Delayed HTTP Traffic Processing After Mass Websocket
                    Disconnect/Reconnect
           Product: Tomcat 9
           Version: 9.0.75
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: WebSocket
          Assignee: dev@tomcat.apache.org
          Reporter: inconceiva...@gmail.com
  Target Milestone: -----

Apache Tomcat Bug Report
Delayed HTTP Traffic Processing After Mass Websocket Disconnect/Reconnect

Description:

A significant delay of 10+ minutes occurs in resuming normal HTTP traffic
processing after a mass websocket disconnect/reconnect event. This issue arises
when a network interruption or stop-the-world garbage collection event exceeds
the maxIdleTimeout (35 seconds), leading to numerous websocket session
closures. 

With several thousand websocket sessions closing simultaneously, all available
nio2 threads (maxThreads=50) become occupied with the closure process. These
threads enter a continuous loop, repeatedly calling Thread.yield while waiting
to acquire the WsRemoteEndpointImplBase messagePartInProgress semaphore. This
behavior, introduced as part of the fix for BZ66508, allows closing threads to
relinquish CPU time while waiting for the send semaphore (up to the default
20-second timeout).

java. base@11.0.21/java.lang.Thread.yield(Native Method)
org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.acquireMessagePartInProgressSemaphore(WsRemoteEndpointImplServer.java:130)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase-java:
292)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.
java: 256)
org.apache.tomcat.websocket.WsSession.sendCloseMessage(WsSession.java:801)
org.apache.tomcat.websocket.WsSession.onClose(WsSession.java:711)

Observations indicate that on Linux, Thread.yield places the thread at a lower
priority in the CPU scheduling queue, resulting in a prolonged series of yield
calls until the timeout is reached and a SocketTimeoutException is triggered.
HTTP traffic processing remains stalled until all session closures are
completed.

We have implemented a temporary solution by introducing a property to limit the
time spent in the on-close yield loop. Reducing this value from the default
significantly improves recovery time. Additionally, decreasing maxThreads
appears to further extend the recovery time, although the exact relationship
requires further investigation.

Reproducing the Issue:

The issue, initially identified in a scenario with 50 threads and 5000 maximum
websocket connections, can also be reproduced at a smaller scale with varying
thread and session counts.

1. Establish several thousand websocket connections that periodically
send/receive data to simulate traffic.
2. Induce a JVM pause or network interruption lasting 40 seconds or more.
3. Restore client-side connectivity.
4. Start a timer and attempt to obtain a 200 response from the server.
5. Stop the timer once a successful response is received.

Test Configurations and Results:

5 nio2 threads, 300 websocket connections:

Close Timeout   Recovery Times (seconds)
10s             218, 300, 159, 168, 312
 5s             60, 42, 102, 199, 160
 2s             27, 30, 42, 19, 18
 1s             13, 15, 15

15 nio2 threads, 300 websocket connections:

Close Timeout   Recovery Time (seconds)
2s              11, 8, 7, 6, 7, 12

Observations:

The issue was initially observed with Tomcat 9.0.75 (embedded) and remains
reproducible with versions up to 9.0.82 (embedded), even with the 9.0.86 fix
for reentrant lock on close handling applied. While the 9.0.86 fix resolved a
memory leak, it did not alleviate the extended recovery times.

Proposed Solution:

Introducing a separate property specifically for the on-close send timeout
would allow for finer-grained control and optimization of session closure
behavior, particularly for servers operating with fixed thread pool sizes.

Additional Notes:

While BZ66508 removed the fixed timeout for on-close acquisition, the potential
for a 20-second wait during semaphore acquisition persists, leading to
prolonged session closure times and increased overhead on the OS scheduler due
to the repeated yield calls.

We are investigating the precise relationship between thread count and recovery
time and will provide additional data as it becomes available.

We believe that implementing the proposed solution would significantly improve
Tomcat's performance under these conditions and provide administrators with
greater control over resource utilization during mass websocket disconnect
events.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to