Csaba Ringhofer created THRIFT-5846: ---------------------------------------
Summary: Hang when closing TBufferedTransport + TSslTransport Key: THRIFT-5846 URL: https://issues.apache.org/jira/browse/THRIFT-5846 Project: Thrift Issue Type: Bug Components: C++ - Library Reporter: Csaba Ringhofer The issue occurred in Impala tests when switching to OpenSsl 3.2 (IMPALA-13680). While the root cause of the regression may be some change in OpenSsl, Thrift's behavior seems also strange as it tries to do SSL handshake when closing the transport: {code} apache::thrift::transport::TSSLSocket::waitForEvent(bool) [TSSLSocket.cpp : 881 + 0xa] apache::thrift::transport::TSSLSocket::initializeHandshake() [TSSLSocket.cpp : 683 + 0x12] apache::thrift::transport::TSSLSocket::flush() [TSSLSocket.cpp : 613 + 0x5] apache::thrift::transport::TBufferedTransport::close() [TBufferTransports.cpp : 133 + 0x3] apache::thrift::server::TAcceptQueueServer::Task::run() [TAcceptQueueServer.cpp : 111 + 0x3] {code} Note that Impala uses Thrift 0.16.0 with a few patches so line numbers may not match with the original Thrift source code, but the Impala changes do not seem relevant in this case. The timeline of the events is: 1. TSSLSocket wrapped in TBufferedTransport is created and opened 2. the first TSSLSocket::initializeHandshake() fails and leads to throwing an TTransportException with "SSL_accept: wrong version number (SSL_error_code = 1)" 3. Impala tries to close the transport 4. as the transport is TBufferedTransport, it calls flush on close() https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TBufferTransports.h#L236 5. TBufferedTransport::flush() calls the underlying transport's flush() https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TBufferTransports.cpp#L134 6. TSSlSocket::flush() calls initializeHandshake() https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L586 The hang only happens with the "wrong version number" error, other SSL_accept errors were handled as before (e.g. "unsupported protocol"). My expectation is that if the transport's open() was successful then calling close() should be safe, even if a later function throws an exception. In case of TSSLSocket open() just opens the socket and the first read()/write() will do the actual SSL handshake. If the handshake fails then the TSSLSocket enters a strange state where isOpen() returns true but the ssl_ object is not usable. I think that a possible solution is to shutdown+free ssl_ if the handshake failed: https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L698 This means that isOpen() would return false after the error + TSSLSocket::flush() would not try to reestablish connection. It would also make sense to skip flushing in TBufferedTransport::close() if the underlying transport is not open. -- This message was sent by Atlassian Jira (v8.20.10#820010)