Hello Andrew Sherman, Joe McDonnell, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/17332
to look at the new patch set (#2).
Change subject: IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission
service and fix consistency between coord failure detection and registration
......................................................................
IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission service
and fix consistency between coord failure detection and registration
Major changes:
IMPALA-9976:
- Leverages the admission heartbeat mechanism to signal the
coordinator to send its complete admission state
- No RPCs are serviced by a coordinator until it has sent its complete
admission state. This is to prevent making admission decisions till
admission service has built its view of the cluster
- The complete admission state consists of the states of all queries
that have successfully been admitted, that is, received a valid
schedule from the admission controller and have marked its admission
as complete (for remote admission it means its pending admit status
has transitioned from true to false)
- This helps prevent sending incomplete/inconsistent state to the
admission controller
- Queries that have not started admission get a chance to send their
request to the new service
- Queries that are queued restart the admission process by sending
the request again. This re-try is now also marked in the query profile
- Other RPCs like ReleaseBackend, ReleaseQuery, CancelQuery that
don't get serviced (till initial admission state is sent) can result
in inconsistent state. This state will be rectified in the admission
heartbeats
- AdmitQuery and GetQueryStatus just retry again if they notice a
network failure(assuming admissiond might be down/restarting) or
received the error message that they cannot be serviced yet
admissiond is waiting on initial state from this coordinator)
IMPALA-10866:
- Made sure that admission state removal on failure detection and
admission state rebuilding on coordinator registration are atomic
operations.
- Leverage statestore's membership view to detect failure and
allow coordinator registration.
Limitations:
- Rebuilding the state can not ensure that queued queries will
maintain their spot in the queue.
- Queries can be admitted before all coordinators get a chance to
send their state. This can result in a brief period of over-admission
We cannot rely completely on the statestore membership update and
wait for all coordinators there to send admission state because
that membership is also dynamic which makes it difficult to decide
when to assume that the admission state is complete.
- The functionalities for coordinator failure detection and
registration rely completely on the statestore.
Testing:
- Added end to end tests
Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
---
M be/src/runtime/coordinator-backend-resource-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/scheduling/admission-control-client.cc
M be/src/scheduling/admission-control-client.h
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller-test.cc
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/admissiond-env.cc
M be/src/scheduling/local-admission-control-client.cc
M be/src/scheduling/local-admission-control-client.h
M be/src/scheduling/remote-admission-control-client.cc
M be/src/scheduling/remote-admission-control-client.h
M be/src/scheduling/schedule-state.cc
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M common/protobuf/admission_control_service.proto
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_admission_controller.py
24 files changed, 756 insertions(+), 93 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/17332/2
--
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>