Hello Michael Ho, Philip Zeyliger, Todd Lipcon, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12299

to look at the new patch set (#7).

Change subject: IMPALA-2990: timeout unresponsive queries in coordinator
......................................................................

IMPALA-2990: timeout unresponsive queries in coordinator

The coordinator currently waits indefinitely if it does not receive a
status report from a backend. This could cause a query to hang
indefinitely in certain situations, for example if the backend decides
to cancel itself as a result of failed status report rpcs.

This patch adds a thread to ImpalaServer which periodically iterates
over all queries for which that server is the coordinator and cancels
any that haven't had a report from a backend in a certain amount of
time.

The timeout is calculated as the longest a backend will attempt to
retry sending status reports before giving up and cancelling itself.
With the default flags, this timeout is about 15 minutes.

The thread wakes up at an interval of the calculated timeout + 10%

TODO:
- Write functional tests once the appropriate mechanisms are in place
  to simulate errors (IMPALA-8138)

Testing:
Ran tests on a 10 node cluster loaded with tpch 500:
- Ran the stress test for 1000 queries with the debug actions:
  'REPORT_EXEC_STATUS_SEND:[email protected]|REPORT_EXEC_STATUS_RECV:[email protected]'
  Prior to this patch, this setup results in hanging queries. With
  this patch, no hangs were observed.
- Ran perf tests with 4 concurrent streams, 3 iterations per query.
  Found no change in performance.

Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/generate_error_codes.py
10 files changed, 155 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/12299/7
--
To view, visit http://gerrit.cloudera.org:8080/12299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987
Gerrit-Change-Number: 12299
Gerrit-PatchSet: 7
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Philip Zeyliger <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>

Reply via email to