Joe McDonnell created IMPALA-15115:
--------------------------------------

             Summary: Race condition where GetOperationStatus() may return 
error when query is retried
                 Key: IMPALA-15115
                 URL: https://issues.apache.org/jira/browse/IMPALA-15115
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 5.0.0
            Reporter: Joe McDonnell
            Assignee: Joe McDonnell


In HS2's GetOperationStatus() and Beeswax's get_state(), the code looks like 
this:
{code:java}
void ImpalaServer::GetOperationStatus(TGetOperationStatusResp& return_val,
    const TGetOperationStatusReq& request) {
...
   status = GetActiveQueryHandle(query_id, &query_handle); <---- #1
...
  // When using long polling, this waits up to long_polling_time_ms 
milliseconds for
  // query completion.polling
  query_handle->WaitForCompletionExecState(); <---- #2
...
  {
    lock_guard<mutex> l(*query_handle->lock());
    TOperationState::type operation_state = query_handle->TOperationState();
    return_val.__set_operationState(operation_state); <---- #3
    if (operation_state == TOperationState::ERROR_STATE) {
      DCHECK(!query_handle->query_status().ok());
      return_val.__set_errorMessage(Substitute(QUERY_ERROR_FORMAT,
          PrintId(query_id), query_handle->query_status().GetDetail()));
      return_val.__set_sqlState(SQLSTATE_GENERAL_ERROR);
    } else {
      ClientRequestState::RetryState retry_state = query_handle->retry_state();
      if (retry_state != ClientRequestState::RetryState::RETRYING
          && retry_state != ClientRequestState::RetryState::RETRIED) {
        DCHECK(query_handle->query_status().ok());
      }
    }{code}
If we get the active query handle in #1, then it fails and gets retried before 
#3, GetOperationStatus() will return an error even though the query is being 
retried. This is deterministic with long polling, because it waits for 
significant time at #2 and gets posted out when the query hits an error. 
However, this can happen without long polling, and it might be causing some 
flakiness for some of our retry tests.

When we have the lock, if the query hit an error, we should check to see if it 
was retried. If it was retried, we can call GetActiveQueryHandle() to get the 
new handle and then return its status.

The same basic issue exists for Beeswax as well. This is a blocker for enabling 
long polling by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to