On Fri, 7 Nov 2025 18:14:38 GMT, Mark Sheppard <[email protected]> wrote:
>> Nothing looks wrong with the test or the code. The failure happens rarely - >> probably when the machine is under load: this test tries to saturate the >> socket buffers and is resource consuming. >> >> The proposed fix is just to double the jtreg timeout for this test from 120 >> to 240. > > It is fine to increase the jtreg test timeout. BUT with the TIMEOUT FACTOR > reverted to 4 and hence an overall jtreg timeout 8 minutes, the most likely > outcome of increasing the test's explicit timeout to 240 is for the test to > timeout after 16 minutes, when this "moribund" condition arises. Take into > account that a typical execution time in CI is < 20 seconds > And is < 4 seconds on laptop. > > The process capture suggests that the writer and reader in the readSlowly and > writeSlowly have got stuck > It's a complex enough test with the HttpClient also stuck in a "poll" waiting > on network events > I haven't study the test in depth, but it smells like a race > condition/conditions Taking into account the observations from @msheppar I have revised the fix. The more important change is that I have inverted the order in which the RawChannel.Events where registered, so that the ReadEvent (client side) is registered before the WriteEvent (client side). The suspicion here is that the handling of the WriteEvent is hogging the CPU (since there is always something to write) and the read event might not get registered until the test times out. That may - or may not be, the reason of the observed test failures, but comparing the logs of the tests that pass with the log of the tests that fail show that in the failing case the test is busy writing, and the read event is not being fired. Another possibility is that some kind of failure happened to one of the threads, so the fixes also adds more logging and improve failure reporting so that we can better diagnose the issue if the test fails again. Repeated testing has not seen the test fail again with those changes, but since the failure was happening very rarely in the first place that might not mean anything. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28178#issuecomment-3517142365
