[Impala-ASF-CR] IMPALA-12057: Track removed coordinators to reject queued queries early

Riza Suminto (Code Review) Wed, 09 Jul 2025 15:12:48 -0700

Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23094 )


Change subject: IMPALA-12057: Track removed coordinators to reject queued 
queries early
......................................................................


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/23094/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23094/5//COMMIT_MSG@13
PS5, Line 13: resulting in unnecessary waiting until timeout.
What is the default admission queue timeout? Please mention in commit message.

Does it make sense to retain removed_coordinators_map entries (assign Time To 
Live) until this timeout pass rather than set a fixed 1000 count? The TTL check 
and entry removal can be done when removed_coordinators_map is accessed.


http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/admission-controller.cc@263
PS5, Line 263: The coordinator no longer exists.
Add coordinator identifier or host/port in this message? It will make 
association easier with query profile.

FYI, query profile has entry like this in Summary

  Coordinator: vb1427.halxg.cloudera.com:27000


http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/admission-controller.cc@2547
PS5, Line 2547: FindGroupToAdmitOrReject
The call path to detect Coordinator removal is from FindGroupToAdmitOrReject -> 
ComputeGroupScheduleStates.
So why not make FindGroupToAdmitOrReject to return False or 
ComputeGroupScheduleStates to return non-OK status upon finding Coordinator 
removal?

is_rejected should be True here if Coordinator is in removed_coordinators_map, 
right?


http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/cluster-membership-mgr-test.cc
File be/src/scheduling/cluster-membership-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/cluster-membership-mgr-test.cc@329
PS5, Line 329: for (int i = 0; i < 2; ++i) {
Can you increase the loop? Maybe 10?


http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/cluster-membership-mgr-test.cc@332
PS5, Line 332: PrintId(be.backend_id())
Can you create another test like this but with changing backend_id in each loop?


http://gerrit.cloudera.org:8080/#/c/23094/5/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/23094/5/tests/custom_cluster/test_admission_controller.py@2170
PS5, Line 2170: test_coord_not_registered_in_ac(self)
Can you loop the test body, say, 3x to ensure consistent behavior between 
coordinator restart? (or intermittent network partition).


http://gerrit.cloudera.org:8080/#/c/23094/5/tests/custom_cluster/test_admission_controller.py@2178
PS5, Line 2178: self.execute_query_async(query)
Use self.client.execute_async so it consistent with L2185.



--
To view, visit http://gerrit.cloudera.org:8080/23094
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0
Gerrit-Change-Number: 23094
Gerrit-PatchSet: 5
Gerrit-Owner: Yida Wu <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Wed, 09 Jul 2025 22:12:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12057: Track removed coordinators to reject queued queries early

Reply via email to