Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/23264 )
Change subject: IMPALA-14271: Reapply the core piece of IMPALA-6984 ...................................................................... IMPALA-14271: Reapply the core piece of IMPALA-6984 IMPALA-6984 changed the behavior to cancel backends when the query reaches the RETURNED_RESULTS state. This ran into a regression on large clusters where a query would end up waiting 10 seconds. IMPALA-10047 reverted the core piece of the change. For tuple caching, we found that a scan node can get stuck waiting for a global runtime filter. It turns out that the coordinator will not send out global runtime filters if the query is in a terminal state. Tuple caching was causing queries to reach the RETURNED_RESULTS phase before the runtime filter could be sent out. Reenabling the core part of IMPALA-6984 sends out a cancel as soon as the query transitions to RETURNED_RESULTS and wakes up any fragment instances waiting on runtime filters. The underlying cause of IMPALA-10047 is a tangle of locks that causes us to exhaust the RPC threads. The coordinator is holding a lock on the backend state while it sends the cancel synchronously. Other backends that complete during that time run Coordinator::BackendState::LogFirstInProgress(), which iterates through backend states to find the first that is not done. The check to see if a backend state is done takes a lock on the backend state. The problem case is that the coordinator may be sending a cancel to a backend on itself. In that case, it needs an RPC thread on the coordinator to be available to process the cancel. If all of the RPC threads are processing updates, they can all call LogFirstInProgress() and get stuck on the backend state lock for the coordinator's fragment. In that case, it becomes a temporary deadlock as the cancel can't be processed and the coordinator won't release the lock. It only gets resolved by the RPC timing out. To resolve this, this changes the Cancel() method to drop the lock while doing the CancelQueryFInstances RPC. It reacquires the lock when it finishes the RPC. Testing: - Hand tested with 10 impalads and control_service_num_svc_threads=1 Without the fix, it reproduces easily after reverting IMPALA-10047. With the fix, it doesn't reproduce. Change-Id: Ia058b03c72cc4bb83b0bd0a19ff6c8c43a647974 Reviewed-on: http://gerrit.cloudera.org:8080/23264 Reviewed-by: Yida Wu <[email protected]> Reviewed-by: Michael Smith <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator.cc 2 files changed, 22 insertions(+), 1 deletion(-) Approvals: Yida Wu: Looks good to me, but someone else must approve Michael Smith: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/23264 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ia058b03c72cc4bb83b0bd0a19ff6c8c43a647974 Gerrit-Change-Number: 23264 Gerrit-PatchSet: 5 Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]>
