[ 
https://issues.apache.org/jira/browse/IMPALA-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720081#comment-17720081
 ] 

Joe McDonnell commented on IMPALA-12114:
----------------------------------------

Here is what is happening:

Our TSSLSocket is wrapped in a TBufferedTransport. The TBufferedTransport 
implements peek() by calling read() on the underlying TSSLSocket (not peek()).
{noformat}
  bool peek() override {
    if (rBase_ == rBound_) {
      setReadBuffer(rBuf_.get(), transport_->read(rBuf_.get(), rBufSize_));
    }
    return (rBound_ > rBase_);
  }{noformat}
[https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/transport/TBufferTransports.h#L228-L233]

TSSLSocket has a field readRetryCount_. When we call in to TSSLSocket::read(), 
either the read is successful and we zero out readRetryCount_ or it is not 
successful and it bumps readRetryCount_. In our case, we hit the timeout, so 
this is not a successful read and the counter is bumped for each peek() we do 
on the TBufferedTransport.

 
{noformat}
    bytes = SSL_read(ssl_, buf, len);
    int32_t errno_copy = THRIFT_GET_SOCKET_ERROR;
    int32_t error = SSL_get_error(ssl_, bytes);
    readRetryCount_++;
    if (error == SSL_ERROR_NONE) {
      readRetryCount_ = 0;
      break;
    }{noformat}
 

[https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L425-L428]

When readRetryCount_ hits the limit (defaults to 5), we return 0 from the 
TSSLSocket::read() call:
{noformat}
uint32_t TSSLSocket::read(uint8_t* buf, uint32_t len) {
...
  int32_t bytes = 0;
  while (readRetryCount_ < maxRecvRetries_) {
    ... the heart of the read logic, including the maintenance of 
readRetryCount_ ...
  }
  return bytes;
}{noformat}
[https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L420-L421]

[https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L490-L491]

This causes peek() to return false, because the rBase_ == rBound_ and the read 
was empty. Then we fall out of our loop because peek() returned 0.

> SSL Thrift connections disconnect if idle more than ~150 seconds
> ----------------------------------------------------------------
>
>                 Key: IMPALA-12114
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12114
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>
> A test cluster ran into issues with idle connections being disconnected when 
> using SSL.
> This reproduces on my development environment with these steps:
>  # Start Impala with SSL enabled
> {noformat}
> bin/start-impala-cluster.py 
> --impalad_args="--ssl_client_ca_certificate=${IMPALA_HOME}/be/src/testutil/server-cert.pem
>  --ssl_server_certificate=${IMPALA_HOME}/be/src/testutil/server-cert.pem 
> --ssl_private_key=${IMPALA_HOME}/be/src/testutil/server-key.pem 
> --hostname=localhost --idle_client_poll_period_s=30 -v=2" 
> --state_store_args="--ssl_client_ca_certificate=${IMPALA_HOME}/be/src/testutil/server-cert.pem
>  --ssl_server_certificate=${IMPALA_HOME}/be/src/testutil/server-cert.pem 
> --ssl_private_key=${IMPALA_HOME}/be/src/testutil/server-key.pem 
> --hostname=localhost" 
> --catalogd_args="--ssl_client_ca_certificate=${IMPALA_HOME}/be/src/testutil/server-cert.pem
>  --ssl_server_certificate=${IMPALA_HOME}/be/src/testutil/server-cert.pem 
> --ssl_private_key=${IMPALA_HOME}/be/src/testutil/server-key.pem 
> --hostname=localhost" --cluster_size=1{noformat}
>  # Connect with impala-shell
> {noformat}
> impala-shell --ssl{noformat}
>  # Leave this idle for 150+ seconds
> In the Impalad logs will be a statement like this:
> {noformat}
> I0503 22:11:53.233147 206554 impala-server.cc:2488] Connection 
> 20470cb275a1d256:3d68601942f3179f from client 172.27.100.70:42540 to server 
> hiveserver2-frontend closed. The connection had 2 associated 
> session(s).{noformat}
>  # Run a statement in impala-shell and will show that it needs to reconnect
> {noformat}
> default> show tables;
> Caught exception TSocket read 0 bytes, type=<class 
> 'thrift.transport.TTransport.TTransportException'> in PingImpalaHS2Service. 
> Caught exception [Errno 32] Broken pipe, type=<class 'socket.error'> in 
> CloseSession. 
> Warning: close session RPC failed: [Errno 32] Broken pipe, <class 
> 'socket.error'>
> Connection lost, reconnecting...
> ... then it retries and succeeds{noformat}
> Tracing through the code, it appears that this peek() call returns false:
> {noformat}
>       try {
>         bytes_pending = input_->getTransport()->peek();
>         break;
>       } catch (const TTransportException& ttx) {{noformat}
> bytes_pending is false, and this causes the connection to be closed.
> This doesn't seem to impact Impala with older Thrift versions, so maybe 
> something changed in Thrift 0.16.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to