[
https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526742#comment-17526742
]
Wenzhe Zhou edited comment on IMPALA-11263 at 4/24/22 8:34 AM:
---------------------------------------------------------------
In function Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true
and exec_done_ equals false (that means Coordinator::BackendState::ExecAsync()
is called, but callback function Coordinator::BackendState::ExecCompleteCb() is
not called), we will call RpcController::Cancel() to cancel Exec() RPC, then
call Coordinator::BackendState::WaitOnExecLocked() to wait RPC callback
function Coordinator::BackendState::ExecCompleteCb() to be called.
>From above log message, Coordinator::BackendState::Cancel() for the 4-th
>backend hang after calling WaitOnExecLocked(). That means the RPC callback
>function was not called even after the timeout.
Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished
successfully, finished with error, cancelled, or timeout.
Read KRPC code and found that it's possible that KRPC callback function is not
called in a corner case - The RPC is cancelled while it's in SENDING state and
get socket write error.
Impala coordinator calls RpcController::Cancel() to schedule a RPC cancellation
task for reactor thread pool. When reactor thread executes the cancellation
task with function ReactorThread::CancelOutboundCall(), the function calls
Connection::CancelOutboundCall(), then calls OutboundCall::Cancel().
Connection::CancelOutboundCall() reset car->call as null pointer which will
lead Connection::HandleOutboundCallTimeout() to skip calling
OutboundCall::SetTimedOut(). OutboundCall::Cancel() will not call
OutboundCall::SetCancelled() if the OutboundCall is in SENDING state.
OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is
called when the state is transferred from SENDING to SENT. So if a RPC is
cancelled, OutboundCall::SetTimedOut() will not be called for its OutboundCall
when the timeout of OutboundCall is handled in
Connection::HandleOutboundCallTimeout(), and OutboundCall::SetCancelled() will
not be called until OutboundCall:SetSent() is called if OutboundCall's state is
in SENDING state.
OutboundCall:SetSent() is called by function
CallTransferCallbacks::NotifyTransferFinished()
if notification of transfer finishing is received after sending a RPC call on
the wire.
Connection::ProcessOutboundTransfers() call OutboundCall::SetSending() to set
OutboundCall's state as SENDING when starting transfer RPC. It then calls
OutboundTransfer::SendBuffer() to send data through socket.
OutboundTransfer::SendBuffer() calls socket->Writev() to send data. If
socket->Writev() return error, the SendBuffer() function will return error
without calling CallTransferCallbacks::NotifyTransferFinished() so
OutboundCall::SetSent() will not be called. This lead to
OutboundCall::SetCancelled() is not called for the OutboundCall.
Connection::ProcessOutboundTransfers() then calls
ReactorThread::DestroyConnection() to destroy the connection.
ReactorThread::DestroyConnection() calls Connection::Shutdown() to clear all
outbound calls which have been sent and were awaiting a response. But for a RPC
being cancelled, its car->call is already reset as null pointer so
OutboundCall::SetFailed() will not be called for the OutboundCall object.
Since OutboundCall::SetFailed(), OutboundCall::SetCancelled() and
OutboundCall::SetTimedOut() are not called for the OutboundCall object, it
cannot be transferred from SENDING state to a finished state, hence RPC
callback function will not be called.
This is the root cause that RPC callback function
Coordinator::BackendState::ExecCompleteCb() was not called in the case and
Coordinator::BackendState::WaitOnExecLocked() wait indefinitely.
was (Author: wzhou):
In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and
exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is
called, but callback function Coordinator::BackendState::ExecCompleteCb() is
not called), we will call RpcController::Cancel() to cancel Exec() RPC then
call WaitOnExecLocked() to wait callback function
Coordinator::BackendState::ExecCompleteCb() to be called.
>From above log message, Coordinator::BackendState::Cancel() for the 4-th
>backend hang after calling WaitOnExecLocked(). That means the callback
>function was not called.
Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished
successfully, finished with error, cancelled, or timeout.
RpcController::Cancel() schedule a cancellation task for reactor thread pool.
When reactor thread execute the task with function
ReactorThread::CancelOutboundCall(). The function call
Connection::CancelOutboundCall() and OutboundCall::Cancel().
Connection::CancelOutboundCall() reset car->call as null so that
Connection::HandleOutboundCallTimeout() will skip to call
OutboundCall::SetTimedOut(). OutboundCall::Cancel() will not call
OutboundCall::SetCancelled() if its state is SENDING.
OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is
called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING
state, OutboundCall::SetTimedOut() will not be called when the timeout of
outbound call is handled in Connection::HandleOutboundCallTimeout(), and
OutboundCall::SetCancelled() will not be called if notification of transfer
finishing (CallTransferCallbacks::NotifyTransferFinished()) is not received
after sending a RPC call on the wire.
Coordinator::BackendState::ExecCompleteCb() will not be called if
OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called.
That means in case the CallTransferCallbacks::NotifyTransferFinished() is not
called after sending a RPC call on the wire,
Coordinator::BackendState::ExecCompleteCb() will not be called, which lead
Coordinator::BackendState::WaitOnExecLocked() to wait indefinitely.
Connection::ProcessOutboundTransfers() call OutboundCall::SetSending() to set
OutboundCall::state_ as SENDING when starting transfer. It then call
OutboundTransfer::SendBuffer() to send data through socket.
OutboundTransfer::SendBuffer() call socket->Writev() to send data. If
socket->Writev() return error, the function will return error without calling
CallTransferCallbacks::NotifyTransferFinished() so OutboundCall::SetSent() will
not be called.
This means if socket write fails, OutboundCall.state_ will stay in SENDING
state and OutboundCall::SetCancelled() will not be called.
This is the case that Coordinator::BackendState::WaitOnExecLocked() wait
indefinitely.
> Coordinator hang when cancelling a query
> ----------------------------------------
>
> Key: IMPALA-11263
> URL: https://issues.apache.org/jira/browse/IMPALA-11263
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Wenzhe Zhou
> Assignee: Wenzhe Zhou
> Priority: Major
>
> In a rare case, callback function Coordinator::BackendState::ExecCompleteCb()
> was not called for the corresponding ExecQueryFInstances RPC somehow. This
> caused coordinator waited indefinitely when cancelling the query.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]