Csaba Ringhofer created THRIFT-5846:
---------------------------------------

             Summary: Hang when closing TBufferedTransport + TSslTransport
                 Key: THRIFT-5846
                 URL: https://issues.apache.org/jira/browse/THRIFT-5846
             Project: Thrift
          Issue Type: Bug
          Components: C++ - Library
            Reporter: Csaba Ringhofer


The issue occurred in Impala tests when switching to OpenSsl 3.2 (IMPALA-13680).
While the root cause of the regression may be some change in OpenSsl, Thrift's 
behavior seems also strange as it tries to do SSL handshake when closing the 
transport:
{code}
apache::thrift::transport::TSSLSocket::waitForEvent(bool)  [TSSLSocket.cpp : 
881 + 0xa]
apache::thrift::transport::TSSLSocket::initializeHandshake() [TSSLSocket.cpp : 
683 + 0x12]
apache::thrift::transport::TSSLSocket::flush() [TSSLSocket.cpp : 613 + 0x5]
apache::thrift::transport::TBufferedTransport::close() [TBufferTransports.cpp : 
133 + 0x3]
apache::thrift::server::TAcceptQueueServer::Task::run() [TAcceptQueueServer.cpp 
: 111 + 0x3]
{code}
Note that Impala uses Thrift 0.16.0 with a few patches so line numbers may not 
match with the original Thrift source code, but the Impala changes do not seem 
relevant in this case.

The timeline of the events is:
1. TSSLSocket wrapped in TBufferedTransport is created and opened
2. the first TSSLSocket::initializeHandshake() fails and leads to throwing an 
TTransportException with "SSL_accept: wrong version number (SSL_error_code = 1)"
3. Impala tries to close the transport
4. as the transport is TBufferedTransport, it calls flush on close() 
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TBufferTransports.h#L236
5. TBufferedTransport::flush() calls the underlying transport's flush() 
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TBufferTransports.cpp#L134
6. TSSlSocket::flush() calls initializeHandshake() 
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L586

The hang only happens with the "wrong version number" error, other SSL_accept 
errors were handled as before (e.g. "unsupported protocol").

My expectation is that if the transport's open() was successful then calling 
close() should be safe, even if a later function throws an exception. In case 
of TSSLSocket open() just opens the socket and the first read()/write() will do 
the actual SSL handshake. If the handshake fails then the TSSLSocket enters a 
strange state where isOpen() returns true but the ssl_ object is not usable.

I think that a possible solution is to shutdown+free ssl_ if the handshake 
failed:
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L698
This means that isOpen() would return false after the error + 
TSSLSocket::flush() would not try to reestablish connection. It would also make 
sense to skip flushing in TBufferedTransport::close() if the underlying 
transport is not open.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to