Yida Wu has uploaded a new patch set (#15). ( 
http://gerrit.cloudera.org:8080/23094 )

Change subject: IMPALA-12057: Track removed coordinators to reject queued 
queries early
......................................................................

IMPALA-12057: Track removed coordinators to reject queued queries early

Queries in global admission control can remain queued for a long time
if they are assigned to a coordinator that has already left the
cluster. Admissiond can't distinguish between a coordinator that
hasn’t yet been propagated via the statestore and one that has
already been removed, resulting in unnecessary waiting until timeout.
This timeout is determined by either FLAGS_queue_wait_timeout_ms or
the queue_timeout_ms in the pool config. By default,
FLAGS_queue_wait_timeout_ms is 1 minute, but in production it's
normally configured to 10 to 15 minutes.

This change tracks recently removed coordinators and rejects such
queued queries immediately using REASON_COORDINATOR_REMOVED.
To ensure the removed coordinator list remains simple and bounded,
it avoids duplicate entries and enforces FIFO eviction at
the minimum of MAX_REMOVED_COORD_SIZE (1000) and
FLAGS_cluster_membership_retained_removed_coords.

It's possible that a coordinator marked as removed comes back
with the same backend id. In that case, admissiond will see it in
current_backends and won't need to check the removed list. Even
if a coordinator briefly flaps and a request is rejected, it's not
critical, the coordinator can retry. So to keep the design simple
and safe, we keep the removed coord entry as-is.

Added a parameter is_admissiond to the ClusterMembershipMgr
constructor to indicate whether it is running within the admissiond.

Tests:
Passed exhaustive tests.
Added unit tests to verify the eviction logic and the duplicate
case.
Added regression test test_coord_not_registered_in_ac.

Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0
---
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admissiond-env.cc
M be/src/scheduling/cluster-membership-mgr-test.cc
M be/src/scheduling/cluster-membership-mgr.cc
M be/src/scheduling/cluster-membership-mgr.h
M tests/custom_cluster/test_admission_controller.py
6 files changed, 307 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/23094/15
--
To view, visit http://gerrit.cloudera.org:8080/23094
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0
Gerrit-Change-Number: 23094
Gerrit-PatchSet: 15
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>

Reply via email to