Bikramjeet Vig has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17332


Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................

IMPALA-9155: Add recovery mechanism to admission service

Major changes:
- Leverages the admission heartbeat mechanism to signal the
coordinator to send its complete admission state
- No RPCs are serviced by a coordinator unless it sends its complete
admission state. This is to prevent making admission decisions till
admission service has built its view of the cluster
- The complete admission state consists of the states of all queries
that have successfully been admitted, that is, received a valid
schedule from the admission controller and have marked its admission
as complete (for remote admission it means its pending admit status
has transitioned from true to false)
- This helps prevent sending incomplete/inconsistent state to the
admission controller
- Queries that have not started admission get a chance to send their
request to the new service
- Queries that are queued restart the admission process by sending
the request again. This re-try is now also marked in the query profile
- Other RPCs like ReleaseBackend, ReleaseQuery, CancelQuery that
don't get serviced (till initial admission state is sent) can result
in inconsistent state. This state will be rectified in the admission
heartbeats
- AdmitQuery and GetQueryStatus just retry again if they notice a
network failure(assuming admissiond might be down/restarting) or
received the error message that they cannot be serviced yet
admissiond is waiting on initial state from this coordinator)

Limitations:
- Rebuilding the state can not ensure that queued queries will
maintain their spot in the queue.
- Queries can be admitted before all coordinators get a chance to
send their state. This can result in a brief period of over-admission
We cannot rely completely on the statestore membership update and
wait for all coordinators there to send admission state because
that membership is also dynamic which makes it difficult to decide
when to assume that the admission state is complete.

Testing:
- Added end to end tests

Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
---
M be/src/runtime/coordinator-backend-resource-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/scheduling/admission-control-client.cc
M be/src/scheduling/admission-control-client.h
M be/src/scheduling/admission-control-service.cc
M be/src/scheduling/admission-control-service.h
M be/src/scheduling/admission-controller-test.cc
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/scheduling/local-admission-control-client.cc
M be/src/scheduling/local-admission-control-client.h
M be/src/scheduling/remote-admission-control-client.cc
M be/src/scheduling/remote-admission-control-client.h
M be/src/scheduling/schedule-state.cc
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M common/protobuf/admission_control_service.proto
M common/thrift/generate_error_codes.py
M tests/custom_cluster/test_admission_controller.py
23 files changed, 674 insertions(+), 65 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/17332/1
--
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <[email protected]>

Reply via email to