gaurav-narula opened a new pull request, #16354:
URL: https://github.com/apache/kafka/pull/16354

   We observed some runs of the test suite caused CI pipelines to stall.
   
   A thread dump revealed that the test runner was blocked trying to read from 
a socket, while attempting to close the socket [[0]]. It turns out this is due 
to a bug in JDK which is very similar to 
[JDK-8274524](https://bugs.openjdk.org/browse/JDK-8274524), but it affects the 
else branch of `SSLSocketImpl::bruteForceCloseInput` [[1]] which wasn't fixed 
in JDK-8274524.
   
   Since the blocking happens in a native call, the test runner's timeouts have 
no effect as the blocked test runner thread doesn't seem to respond to 
interrupts.
   
   As a mitigation in Kafka's test suite, this change adds `SO_TIMEOUT` of 30 
seconds to all the TLS sockets handled by `EchoServer`. The timeout is 
reasonably high for tests and a finite upper bound avoids infinite blocking of 
the test suite.
   
   [0]: https://issues.apache.org/jira/secure/attachment/13066427/timeout.log
   [1]: 
https://github.com/openjdk/jdk/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/sun/security/ssl/SSLSocketImpl.java#L808
   
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to