Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/23094 )
Change subject: IMPALA-12057: Track removed coordinators to reject queued queries early ...................................................................... IMPALA-12057: Track removed coordinators to reject queued queries early Queries in global admission control can remain queued for a long time if they are assigned to a coordinator that has already left the cluster. Admissiond can't distinguish between a coordinator that hasn’t yet been propagated via the statestore and one that has already been removed, resulting in unnecessary waiting until timeout. This timeout is determined by either FLAGS_queue_wait_timeout_ms or the queue_timeout_ms in the pool config. By default, FLAGS_queue_wait_timeout_ms is 1 minute, but in production it's normally configured to 10 to 15 minutes. This change tracks recently removed coordinators and rejects such queued queries immediately using REASON_COORDINATOR_REMOVED. To ensure the removed coordinator list remains simple and bounded, it avoids duplicate entries and enforces FIFO eviction at the minimum of MAX_REMOVED_COORD_SIZE (1000) and FLAGS_cluster_membership_retained_removed_coords. It's possible that a coordinator marked as removed comes back with the same backend id. In that case, admissiond will see it in current_backends and won't need to check the removed list. Even if a coordinator briefly flaps and a request is rejected, it's not critical, the coordinator can retry. So to keep the design simple and safe, we keep the removed coord entry as-is. Added a parameter is_admissiond to the ClusterMembershipMgr constructor to indicate whether it is running within the admissiond. Tests: Passed exhaustive tests. Added unit tests to verify the eviction logic and the duplicate case. Added regression test test_coord_not_registered_in_ac. Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0 Reviewed-on: http://gerrit.cloudera.org:8080/23094 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/scheduling/admission-controller.cc M be/src/scheduling/admissiond-env.cc M be/src/scheduling/cluster-membership-mgr-test.cc M be/src/scheduling/cluster-membership-mgr.cc M be/src/scheduling/cluster-membership-mgr.h M tests/custom_cluster/test_admission_controller.py 6 files changed, 307 insertions(+), 24 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/23094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0 Gerrit-Change-Number: 23094 Gerrit-PatchSet: 17 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]>
