Hello Michael Ho, Thomas Marshall, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12082
to look at the new patch set (#8).
Change subject: IMPALA-7931: fix executor shutdown races
......................................................................
IMPALA-7931: fix executor shutdown races
There were two races:
* queries were terminated because of an impalad being detected
as failed by the statestore even if the query had finished
executing on that impalad.
* NUM_FRAGMENTS_IN_FLIGHT was used to detect the backend being
idle, but it was decremented before the final status report
was sent.
The fixes are:
* keep track of the backends that triggered the potential cancellation,
and only proceed with the cancellation if the coordinator has fragments
still executing on the backend.
* add a new metric that keeps track of the number of executing queries,
which isn't decremented until the final status report is sent.
Also do some cleanup/improvements in this code:
* use proper error codes for some errors
* more overloads for Status::Expected()
* also add a metric for the total number of queries executed on the
backend
Testing:
Add a new version of test_shutdown_executor with delays that
trigger both races. This test only runs in exhaustive to avoid
adding ~20s to core build time.
Ran exhaustive tests.
Looped test_restart_services overnight.
Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998
---
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
A be/src/service/cancellation-work.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/generate_error_codes.py
M common/thrift/metrics.json
M tests/custom_cluster/test_restart_services.py
17 files changed, 498 insertions(+), 178 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/12082/8
--
To view, visit http://gerrit.cloudera.org:8080/12082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998
Gerrit-Change-Number: 12082
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>