Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15154

to look at the new patch set (#3).

Change subject: IMPALA-8712: Make ExecQueryFInstances async
......................................................................

IMPALA-8712: Make ExecQueryFInstances async

This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.
============================

TODO: once IMPALA-9335 (krpc rebase) goes in, the change in
be/src/kudu/rpc/connection.cc can be removed from this patch.

Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
---
M be/src/common/global-flags.cc
M be/src/kudu/rpc/connection.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M be/src/runtime/query-state.cc
M be/src/util/counting-barrier.h
M tests/custom_cluster/test_rpc_exception.py
M tests/failure/test_failpoints.py
12 files changed, 442 insertions(+), 267 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/3
--
To view, visit http://gerrit.cloudera.org:8080/15154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Gerrit-Change-Number: 15154
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>

Reply via email to