Bikramjeet Vig has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17332
Change subject: IMPALA-9155: Add recovery mechanism to admission service ...................................................................... IMPALA-9155: Add recovery mechanism to admission service Major changes: - Leverages the admission heartbeat mechanism to signal the coordinator to send its complete admission state - No RPCs are serviced by a coordinator unless it sends its complete admission state. This is to prevent making admission decisions till admission service has built its view of the cluster - The complete admission state consists of the states of all queries that have successfully been admitted, that is, received a valid schedule from the admission controller and have marked its admission as complete (for remote admission it means its pending admit status has transitioned from true to false) - This helps prevent sending incomplete/inconsistent state to the admission controller - Queries that have not started admission get a chance to send their request to the new service - Queries that are queued restart the admission process by sending the request again. This re-try is now also marked in the query profile - Other RPCs like ReleaseBackend, ReleaseQuery, CancelQuery that don't get serviced (till initial admission state is sent) can result in inconsistent state. This state will be rectified in the admission heartbeats - AdmitQuery and GetQueryStatus just retry again if they notice a network failure(assuming admissiond might be down/restarting) or received the error message that they cannot be serviced yet admissiond is waiting on initial state from this coordinator) Limitations: - Rebuilding the state can not ensure that queued queries will maintain their spot in the queue. - Queries can be admitted before all coordinators get a chance to send their state. This can result in a brief period of over-admission We cannot rely completely on the statestore membership update and wait for all coordinators there to send admission state because that membership is also dynamic which makes it difficult to decide when to assume that the admission state is complete. Testing: - Added end to end tests Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd --- M be/src/runtime/coordinator-backend-resource-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/scheduling/admission-control-client.cc M be/src/scheduling/admission-control-client.h M be/src/scheduling/admission-control-service.cc M be/src/scheduling/admission-control-service.h M be/src/scheduling/admission-controller-test.cc M be/src/scheduling/admission-controller.cc M be/src/scheduling/admission-controller.h M be/src/scheduling/local-admission-control-client.cc M be/src/scheduling/local-admission-control-client.h M be/src/scheduling/remote-admission-control-client.cc M be/src/scheduling/remote-admission-control-client.h M be/src/scheduling/schedule-state.cc M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/impala-server.cc M be/src/service/impala-server.h M common/protobuf/admission_control_service.proto M common/thrift/generate_error_codes.py M tests/custom_cluster/test_admission_controller.py 23 files changed, 674 insertions(+), 65 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/32/17332/1 -- To view, visit http://gerrit.cloudera.org:8080/17332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd Gerrit-Change-Number: 17332 Gerrit-PatchSet: 1 Gerrit-Owner: Bikramjeet Vig <[email protected]>
