[
https://issues.apache.org/jira/browse/IMPALA-11263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526742#comment-17526742
]
Wenzhe Zhou edited comment on IMPALA-11263 at 4/23/22 7:38 AM:
---------------------------------------------------------------
In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and
exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is
called, but callback function Coordinator::BackendState::ExecCompleteCb() is
not called), we will call RpcController::Cancel() to cancel Exec() RPC then
call WaitOnExecLocked() to wait callback function
Coordinator::BackendState::ExecCompleteCb() to be called.
>From above log message, Coordinator::BackendState::Cancel() for the 4-th
>backend hang after calling WaitOnExecLocked(). That means the callback
>function was not called.
Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished
successfully, finished with error, cancelled, or timeout.
RpcController::Cancel() schedule a cancellation task for reactor thread pool.
When reactor thread execute the task with function
ReactorThread::CancelOutboundCall(). The function call
Connection::CancelOutboundCall() and OutboundCall::Cancel().
Connection::CancelOutboundCall() reset car->call as null so that
Connection::HandleOutboundCallTimeout() will skip to call
OutboundCall::SetTimedOut(). OutboundCall::Cancel() will not call
OutboundCall::SetCancelled() if its state is SENDING.
OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is
called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING
state, OutboundCall::SetTimedOut() will not be called when the timeout of
outbound call is handled in Connection::HandleOutboundCallTimeout(), and
OutboundCall::SetCancelled() will not be called if notification of transfer
finishing (CallTransferCallbacks::NotifyTransferFinished()) is not received
after sending a RPC call on the wire.
Coordinator::BackendState::ExecCompleteCb() will not be called if
OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called.
That means in case the CallTransferCallbacks::NotifyTransferFinished() is not
called after sending a RPC call on the wire,
Coordinator::BackendState::ExecCompleteCb() will not be called, which lead
Coordinator::BackendState::WaitOnExecLocked() to wait indefinitely.
Connection::ProcessOutboundTransfers() call OutboundCall::SetSending() to set
OutboundCall::state_ as SENDING when starting transfer. It then call
OutboundTransfer::SendBuffer() to send data through socket.
OutboundTransfer::SendBuffer() call socket->Writev() to send data. If
socket->Writev() return error, the function will return error without calling
CallTransferCallbacks::NotifyTransferFinished() so OutboundCall::SetSent() will
not be called.
This means if socket write fails, OutboundCall.state_ will stay in SENDING
state and OutboundCall::SetCancelled() will not be called.
This is the case that Coordinator::BackendState::WaitOnExecLocked() wait
indefinitely.
was (Author: wzhou):
In Coordinator::BackendState::Cancel(), if exec_rpc_sent_ equals true and
exec_done_ equals false (that means Coordinator::BackendState::ExecAsync() is
called, but callback function Coordinator::BackendState::ExecCompleteCb() is
not called), we will call RpcController::Cancel() to cancel Exec() RPC then
call WaitOnExecLocked() to wait callback function
Coordinator::BackendState::ExecCompleteCb() to be called.
>From above log message, Coordinator::BackendState::Cancel() for the 4-th
>backend hang after calling WaitOnExecLocked(). That means the callback
>function was not called.
Coordinator::BackendState::ExecCompleteCb() should be called if RPC is finished
successfully, finished with error, cancelled, or timeout.
RpcController::Cancel() schedule a cancellation task for reactor thread pool.
When reactor thread execute the task with function
ReactorThread::CancelOutboundCall(). The function call
Connection::CancelOutboundCall() and OutboundCall::Cancel().
Connection::CancelOutboundCall() reset car->call as null so that
Connection::HandleOutboundCallTimeout() will skip to call
OutboundCall::SetTimedOut(). OutboundCall::Cancel() will not call
OutboundCall::SetCancelled() if its state is SENDING.
OutboundCall::SetCancelled() will be called until OutboundCall:SetSent() is
called. So if the RPC is cancelled when OutboundCall.state_ is in SENDING
state, OutboundCall::SetTimedOut() will not be called when the timeout of
outbound call is handled in Connection::HandleOutboundCallTimeout(), and
OutboundCall::SetCancelled() will not be called if notification of transfer
finishing is not received after sending a RPC call on the wire.
Coordinator::BackendState::ExecCompleteCb() will not be called if
OutboundCall::SetCancelled() and OutboundCall::SetTimedOut() are not called.
That means in case the notification of transfer finishing is missing after
sending a RPC call on the wire, Coordinator::BackendState::ExecCompleteCb()
will not be called, which lead Coordinator::BackendState::WaitOnExecLocked() to
wait indefinitely.
> Coordinator hang when cancelling a query
> ----------------------------------------
>
> Key: IMPALA-11263
> URL: https://issues.apache.org/jira/browse/IMPALA-11263
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Wenzhe Zhou
> Assignee: Wenzhe Zhou
> Priority: Major
>
> In a rare case, callback function Coordinator::BackendState::ExecCompleteCb()
> was not called for the corresponding ExecQueryFInstances RPC somehow. This
> caused coordinator waited indefinitely when cancelling the query.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]