[
https://issues.apache.org/jira/browse/IMPALA-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar updated IMPALA-9113:
---------------------------------
Description:
There is a race condition in the query coordination code that could cause
queries to hang indefinitely in an un-cancellable state if an impalad crashes
after the query has transitioned to the FINISHED state, but before all backends
have completed.
The issue occurs if:
* A query produces all results
* A client issues a fetch request to read all of those results
* The client fetch request fetches all available rows (e.g. eos is hit)
* {{Coordinator::GetNext}} then calls
{{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually
calls {{WaitForBackends()}}
* {{WaitForBackends()}} will block until all backends have completed
* One of the impalads running the query crashes, and thus never reports
success for the query fragment it was running
* The {{WaitForBackends()}} call will then block indefinitely
* Any attempt to cancel the query fails because the original fetch request
that drove the {{WaitForBackends()}} call has acquired the
{{ClientRequestState}} lock, which thus prevents any cancellation from
occurring.
Implementing IMPALA-6984 should theoretically fix this because as soon as eos
is hit, the coordinator will call {{CancelBackends()}} rather than
{{WaitForBackends()}}. Another solution would be to add a timeout to the
{{WaitForBackends()}} so that it returns after the timeout is hit, this would
force the fetch request to return 0 rows with {{hasMoreRows=true}}, and unblock
any cancellation threads.
was:
There is a race condition in the query coordination code that could cause
queries to hang indefinitely in an un-cancellable state if an impalad crashes
after the query has transitioned to the FINISHED state, but before all backends
have completed.
The issue occurs if:
* A query produces all results
* A client issues a fetch request to read all of those results
* The client fetch request fetches all available rows (e.g. eos is hit)
* {{Coordinator::GetNext}} then calls
{{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually
calls {{WaitForBackends()}}
* {{WaitForBackends()}} will block until all backends have completed
* One of the impalads running the query crashes, and thus never reports
success for the query fragment it was running
* The {{WaitForBackends()}} call will then block indefinitely
* Any attempt to cancel the query fails because the original fetch request
that drove the {{WaitForBackends()}} call has acquired the
{{ClientRequestState}} lock, which thus prevents any cancellation from
occurring.
Implementing IMPALA-6984 should theoretically fix because as soon as eos is
hit, it would call {{CancelBackends()}} rather than {{WaitForBackends()}}.
Another solution would be to add a timeout to the {{WaitForBackends()}} so that
it returns after the timeout is hit, this would force the fetch request to
return 0 rows with {{hasMoreRows=true}}, and unblock any cancellation threads.
> Queries can hang if an impalad is killed after a query has FINISHED
> -------------------------------------------------------------------
>
> Key: IMPALA-9113
> URL: https://issues.apache.org/jira/browse/IMPALA-9113
> Project: IMPALA
> Issue Type: Bug
> Components: Backend, Clients
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
>
> There is a race condition in the query coordination code that could cause
> queries to hang indefinitely in an un-cancellable state if an impalad crashes
> after the query has transitioned to the FINISHED state, but before all
> backends have completed.
> The issue occurs if:
> * A query produces all results
> * A client issues a fetch request to read all of those results
> * The client fetch request fetches all available rows (e.g. eos is hit)
> * {{Coordinator::GetNext}} then calls
> {{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually
> calls {{WaitForBackends()}}
> * {{WaitForBackends()}} will block until all backends have completed
> * One of the impalads running the query crashes, and thus never reports
> success for the query fragment it was running
> * The {{WaitForBackends()}} call will then block indefinitely
> * Any attempt to cancel the query fails because the original fetch request
> that drove the {{WaitForBackends()}} call has acquired the
> {{ClientRequestState}} lock, which thus prevents any cancellation from
> occurring.
> Implementing IMPALA-6984 should theoretically fix this because as soon as eos
> is hit, the coordinator will call {{CancelBackends()}} rather than
> {{WaitForBackends()}}. Another solution would be to add a timeout to the
> {{WaitForBackends()}} so that it returns after the timeout is hit, this would
> force the fetch request to return 0 rows with {{hasMoreRows=true}}, and
> unblock any cancellation threads.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]