Hello Michael Ho, Philip Zeyliger, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12299

to look at the new patch set (#8).

Change subject: IMPALA-2990: timeout unresponsive queries in coordinator
......................................................................

IMPALA-2990: timeout unresponsive queries in coordinator

The coordinator currently waits indefinitely if it does not receive a
status report from a backend. This could cause a query to hang
indefinitely in certain situations, for example if the backend decides
to cancel itself as a result of failed status report rpcs.

This patch adds a thread to ImpalaServer which periodically iterates
over all queries for which that server is the coordinator and cancels
any that haven't had a report from a backend in a certain amount of
time.

This patch adds two flags:
--status_report_max_retry_s: the maximum number of seconds a backend
  will attempt to send status reports before giving up. This is used
  in place of --status_report_max_retries which is now deprecated.
--status_report_cancellation_padding: the coordinator will wait
    --status_report_max_retry_s *
      (1 + --status_report_cancellation_padding / 100)
  before concluding a backend is not responding and cancelling the
  query.

Testing:
- Added a functional test that runs a query that is cancelled through
  the new mechanism.
Ran tests on a 10 node cluster loaded with tpch 500:
- Ran the stress test for 1000 queries with the debug actions:
  'REPORT_EXEC_STATUS_SEND:FAIL@0.1|REPORT_EXEC_STATUS_RECV:FAIL@0.1'
  Prior to this patch, this setup results in hanging queries. With
  this patch, no hangs were observed.
- Ran perf tests with 4 concurrent streams, 3 iterations per query.
  Found no change in performance.

Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_rpc_timeout.py
12 files changed, 179 insertions(+), 46 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/12299/8
--
To view, visit http://gerrit.cloudera.org:8080/12299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987
Gerrit-Change-Number: 12299
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to