Hello Michael Ho, Philip Zeyliger, Todd Lipcon, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12299 to look at the new patch set (#10). Change subject: IMPALA-2990: timeout unresponsive queries in coordinator ...................................................................... IMPALA-2990: timeout unresponsive queries in coordinator The coordinator currently waits indefinitely if it does not receive a status report from a backend. This could cause a query to hang indefinitely in certain situations, for example if the backend decides to cancel itself as a result of failed status report rpcs. This patch adds a thread to ImpalaServer which periodically iterates over all queries for which that server is the coordinator and cancels any that haven't had a report from a backend in a certain amount of time. This patch adds two flags: --status_report_max_retry_s: the maximum number of seconds a backend will attempt to send status reports before giving up. This is used in place of --status_report_max_retries which is now deprecated. --status_report_cancellation_padding: the coordinator will wait --status_report_max_retry_s * (1 + --status_report_cancellation_padding / 100) before concluding a backend is not responding and cancelling the query. Testing: - Added a functional test that runs a query that is cancelled through the new mechanism. - Passed a full set of exhaustive tests. Ran tests on a 10 node cluster loaded with tpch 500: - Ran the stress test for 1000 queries with the debug actions: 'REPORT_EXEC_STATUS_DELAY:JITTER@1000' Prior to this patch, this setup results in hanging queries. With this patch, no hangs were observed. - Ran perf tests with 4 concurrent streams, 3 iterations per query. Found no change in performance. Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987 --- M be/src/common/global-flags.cc M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/query-state.cc M be/src/runtime/query-state.h M be/src/service/impala-server.cc M be/src/service/impala-server.h M common/thrift/ImpalaInternalService.thrift M common/thrift/generate_error_codes.py M tests/custom_cluster/test_rpc_timeout.py 12 files changed, 190 insertions(+), 46 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/12299/10 -- To view, visit http://gerrit.cloudera.org:8080/12299 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987 Gerrit-Change-Number: 12299 Gerrit-PatchSet: 10 Gerrit-Owner: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>