On Fri, 7 Nov 2025 18:14:38 GMT, Mark Sheppard <[email protected]> wrote:

>> Nothing looks wrong with the test or the code. The failure happens rarely - 
>> probably when the machine is under load: this test tries to saturate the 
>> socket buffers and is resource consuming.
>> 
>> The proposed fix is just to double the jtreg timeout for this test from 120 
>> to 240.
>
> It is fine to increase the jtreg test timeout. BUT with the TIMEOUT FACTOR 
> reverted to 4 and hence  an overall jtreg timeout 8 minutes, the most likely 
> outcome of increasing the test's explicit timeout to 240 is for the test to 
> timeout after 16 minutes, when this "moribund" condition arises. Take into 
> account that a typical execution time in CI is < 20 seconds
> And is < 4 seconds on laptop.
> 
> The process capture suggests that the writer and reader in the readSlowly and 
> writeSlowly have got stuck  
> It's a complex enough test with the HttpClient also stuck in a "poll" waiting 
> on network events
> I haven't study the test in depth, but it smells like a race 
> condition/conditions

Taking into account the observations from @msheppar I have revised the fix. The 
more important change is that I have inverted the order in which the 
RawChannel.Events where registered, so that the ReadEvent (client side) is 
registered before the WriteEvent (client side). The suspicion here is that the 
handling of the WriteEvent is hogging the CPU (since there is always something 
to write) and the read event might not get registered until the test times out. 
That may - or may not be, the reason of the observed test failures, but 
comparing the logs of the tests that pass with the log of the tests that fail 
show that in the failing case the test is busy writing, and the read event is 
not being fired.
Another possibility is that some kind of failure happened to one of the 
threads, so the fixes also adds more logging and improve failure reporting so 
that we can better diagnose the issue if the test fails again.

Repeated testing has not seen the test fail again with those changes, but since 
the failure was happening very rarely in the first place that might not mean 
anything.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28178#issuecomment-3517142365

Reply via email to