[
https://issues.apache.org/jira/browse/IMPALA-11674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619734#comment-17619734
]
Wenzhe Zhou edited comment on IMPALA-11674 at 10/18/22 6:59 PM:
----------------------------------------------------------------
TSSLSocket::peek() in TSSLSocket.cpp was changed to call
TSSLSocket::waitForEvent(). When TSSLSocket::waitForEvent() call THRIFT_POLL
(which is poll() function on Linux) with positive timeout value, THRIFT_POLL
return 0 when the call is timed out (https://linux.die.net/man/2/poll). So that
TSSLSocket::waitForEvent() throw exception
TTransportException(TTransportException::TIMED_OUT, "THRIFT_POLL (timed out)")
for timeout.
IsReadTimeoutTException() and IsPeekTimeoutTException() should be updated to
check new type of exception. Otherwise the functions return wrong values for
timeout, which cause TAcceptQueueServer::Peek() to rethrow the exception to
caller TAcceptQueueServer::run(). TAcceptQueueServer::run() then will write log
message "AcceptQueueServer client died: THRIFT_POLL (timed out)", and close the
connection.
In one reported case, client thrift connections were closed after 30 seconds
with lots of log message "AcceptQueueServer client died: THRIFT_POLL (timed
out)" in coordinator log file. The behavior was matching above code analysis.
[~rizaon] Please verify if my code analysis make sense. I think we have same
issue for Thrift 0.11.0.
cc: [~joemcdonnell]
was (Author: wzhou):
TSSLSocket::peek() in TSSLSocket.cpp was changed to TSSLSocket::waitForEvent().
When TSSLSocket::waitForEvent() call THRIFT_POLL (which is poll() function on
Linux) with positive timeout value, THRIFT_POLL return 0 when the call is timed
out (https://linux.die.net/man/2/poll). So that TSSLSocket::waitForEvent()
throw exception TTransportException(TTransportException::TIMED_OUT,
"THRIFT_POLL (timed out)") for timeout.
IsReadTimeoutTException() and IsPeekTimeoutTException() should be updated to
check new type of exception. Otherwise the functions return wrong values for
timeout, which cause TAcceptQueueServer::Peek() to rethrow the exception to
caller TAcceptQueueServer::run(). TAcceptQueueServer::run() will write log
message "AcceptQueueServer client died: THRIFT_POLL (timed out)", then close
the connection.
In one reported case, client thrift connections were closed after 30 seconds
with lots of log message "AcceptQueueServer client died: THRIFT_POLL (timed
out)" in coordinator log file. The behavior was matching above code analysis.
[~rizaon] Please verify if my code analysis make sense. I think we have same
issue for Thrift 0.11.0.
cc: [~joemcdonnell]
> Fix IsPeekTimeoutTException and IsReadTimeoutTException for thrift-0.16.0
> -------------------------------------------------------------------------
>
> Key: IMPALA-11674
> URL: https://issues.apache.org/jira/browse/IMPALA-11674
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.2.0
> Reporter: Wenzhe Zhou
> Assignee: Riza Suminto
> Priority: Major
>
> IMPALA-7825 upgraded Thrift version from 0.9.3 to 0.11.0, IMPALA-11384
> upgraded CPP Thrift components from 0.11.0 to Thrift-0.16.0.
> Functions IsPeekTimeoutTException() and IsReadTimeoutTException() in
> be/src/rpc/thrift-util.cc make assumption about the implementation of read(),
> peek(), write() and write_partial() in TSocket.cpp and TSSLSocket.cpp. The
> functions read() and peek() in TSSLSocket.cpp were changed in version 0.11.0
> and 0.16.0 to throw different exception for timeout. This cause
> IsPeekTimeoutTException() and IsReadTimeoutTException() return wrong value
> after upgrade thrift, which in turn cause TAcceptQueueServer::Peek() to
> rethrow the exception to caller TAcceptQueueServer::run() and make
> TAcceptQueueServer::run() to close the connection.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]