[ 
https://issues.apache.org/jira/browse/IMPALA-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-9113:
---------------------------------
    Description: 
There is a race condition in the query coordination code that could cause 
queries to hang indefinitely in an un-cancellable state if an impalad crashes 
after the query has transitioned to the FINISHED state, but before all backends 
have completed.

The issue occurs if:
 * A query produces all results
 * A client issues a fetch request to read all of those results
 * The client fetch request fetches all available rows (e.g. eos is hit)
 * {{Coordinator::GetNext}} then calls 
{{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually 
calls {{WaitForBackends()}}
 * {{WaitForBackends()}} will block until all backends have completed
 * One of the impalads running the query crashes, and thus never reports 
success for the query fragment it was running
 * The {{WaitForBackends()}} call will then block indefinitely
 * Any attempt to cancel the query fails because the original fetch request 
that drove the {{WaitForBackends()}} call has acquired the 
{{ClientRequestState}} lock, which thus prevents any cancellation from 
occurring.

Implementing IMPALA-6984 should theoretically fix this because as soon as eos 
is hit, the coordinator will call {{CancelBackends()}} rather than 
{{WaitForBackends()}}. Another solution would be to add a timeout to the 
{{WaitForBackends()}} so that it returns after the timeout is hit, this would 
force the fetch request to return 0 rows with {{hasMoreRows=true}}, and unblock 
any cancellation threads.

  was:
There is a race condition in the query coordination code that could cause 
queries to hang indefinitely in an un-cancellable state if an impalad crashes 
after the query has transitioned to the FINISHED state, but before all backends 
have completed.

The issue occurs if:
 * A query produces all results
 * A client issues a fetch request to read all of those results
 * The client fetch request fetches all available rows (e.g. eos is hit)
 * {{Coordinator::GetNext}} then calls 
{{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually 
calls {{WaitForBackends()}}
 * {{WaitForBackends()}} will block until all backends have completed
 * One of the impalads running the query crashes, and thus never reports 
success for the query fragment it was running
 * The {{WaitForBackends()}} call will then block indefinitely
 * Any attempt to cancel the query fails because the original fetch request 
that drove the {{WaitForBackends()}} call has acquired the 
{{ClientRequestState}} lock, which thus prevents any cancellation from 
occurring.

Implementing IMPALA-6984 should theoretically fix because as soon as eos is 
hit, it would call {{CancelBackends()}} rather than {{WaitForBackends()}}. 
Another solution would be to add a timeout to the {{WaitForBackends()}} so that 
it returns after the timeout is hit, this would force the fetch request to 
return 0 rows with {{hasMoreRows=true}}, and unblock any cancellation threads.


> Queries can hang if an impalad is killed after a query has FINISHED
> -------------------------------------------------------------------
>
>                 Key: IMPALA-9113
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9113
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Clients
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> There is a race condition in the query coordination code that could cause 
> queries to hang indefinitely in an un-cancellable state if an impalad crashes 
> after the query has transitioned to the FINISHED state, but before all 
> backends have completed.
> The issue occurs if:
>  * A query produces all results
>  * A client issues a fetch request to read all of those results
>  * The client fetch request fetches all available rows (e.g. eos is hit)
>  * {{Coordinator::GetNext}} then calls 
> {{SetNonErrorTerminalState(ExecState::RETURNED_RESULTS)}} which eventually 
> calls {{WaitForBackends()}}
>  * {{WaitForBackends()}} will block until all backends have completed
>  * One of the impalads running the query crashes, and thus never reports 
> success for the query fragment it was running
>  * The {{WaitForBackends()}} call will then block indefinitely
>  * Any attempt to cancel the query fails because the original fetch request 
> that drove the {{WaitForBackends()}} call has acquired the 
> {{ClientRequestState}} lock, which thus prevents any cancellation from 
> occurring.
> Implementing IMPALA-6984 should theoretically fix this because as soon as eos 
> is hit, the coordinator will call {{CancelBackends()}} rather than 
> {{WaitForBackends()}}. Another solution would be to add a timeout to the 
> {{WaitForBackends()}} so that it returns after the timeout is hit, this would 
> force the fetch request to return 0 rows with {{hasMoreRows=true}}, and 
> unblock any cancellation threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to