Joe McDonnell created IMPALA-15115:
--------------------------------------
Summary: Race condition where GetOperationStatus() may return
error when query is retried
Key: IMPALA-15115
URL: https://issues.apache.org/jira/browse/IMPALA-15115
Project: IMPALA
Issue Type: Bug
Components: Backend
Affects Versions: Impala 5.0.0
Reporter: Joe McDonnell
Assignee: Joe McDonnell
In HS2's GetOperationStatus() and Beeswax's get_state(), the code looks like
this:
{code:java}
void ImpalaServer::GetOperationStatus(TGetOperationStatusResp& return_val,
const TGetOperationStatusReq& request) {
...
status = GetActiveQueryHandle(query_id, &query_handle); <---- #1
...
// When using long polling, this waits up to long_polling_time_ms
milliseconds for
// query completion.polling
query_handle->WaitForCompletionExecState(); <---- #2
...
{
lock_guard<mutex> l(*query_handle->lock());
TOperationState::type operation_state = query_handle->TOperationState();
return_val.__set_operationState(operation_state); <---- #3
if (operation_state == TOperationState::ERROR_STATE) {
DCHECK(!query_handle->query_status().ok());
return_val.__set_errorMessage(Substitute(QUERY_ERROR_FORMAT,
PrintId(query_id), query_handle->query_status().GetDetail()));
return_val.__set_sqlState(SQLSTATE_GENERAL_ERROR);
} else {
ClientRequestState::RetryState retry_state = query_handle->retry_state();
if (retry_state != ClientRequestState::RetryState::RETRYING
&& retry_state != ClientRequestState::RetryState::RETRIED) {
DCHECK(query_handle->query_status().ok());
}
}{code}
If we get the active query handle in #1, then it fails and gets retried before
#3, GetOperationStatus() will return an error even though the query is being
retried. This is deterministic with long polling, because it waits for
significant time at #2 and gets posted out when the query hits an error.
However, this can happen without long polling, and it might be causing some
flakiness for some of our retry tests.
When we have the lock, if the query hit an error, we should check to see if it
was retried. If it was retried, we can call GetActiveQueryHandle() to get the
new handle and then return its status.
The same basic issue exists for Beeswax as well. This is a blocker for enabling
long polling by default.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)