Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/23094 )
Change subject: IMPALA-12057: Track removed coordinators to reject queued queries early ...................................................................... Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/23094/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23094/5//COMMIT_MSG@13 PS5, Line 13: resulting in unnecessary waiting until timeout. What is the default admission queue timeout? Please mention in commit message. Does it make sense to retain removed_coordinators_map entries (assign Time To Live) until this timeout pass rather than set a fixed 1000 count? The TTL check and entry removal can be done when removed_coordinators_map is accessed. http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/admission-controller.cc@263 PS5, Line 263: The coordinator no longer exists. Add coordinator identifier or host/port in this message? It will make association easier with query profile. FYI, query profile has entry like this in Summary Coordinator: vb1427.halxg.cloudera.com:27000 http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/admission-controller.cc@2547 PS5, Line 2547: FindGroupToAdmitOrReject The call path to detect Coordinator removal is from FindGroupToAdmitOrReject -> ComputeGroupScheduleStates. So why not make FindGroupToAdmitOrReject to return False or ComputeGroupScheduleStates to return non-OK status upon finding Coordinator removal? is_rejected should be True here if Coordinator is in removed_coordinators_map, right? http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/cluster-membership-mgr-test.cc File be/src/scheduling/cluster-membership-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/cluster-membership-mgr-test.cc@329 PS5, Line 329: for (int i = 0; i < 2; ++i) { Can you increase the loop? Maybe 10? http://gerrit.cloudera.org:8080/#/c/23094/5/be/src/scheduling/cluster-membership-mgr-test.cc@332 PS5, Line 332: PrintId(be.backend_id()) Can you create another test like this but with changing backend_id in each loop? http://gerrit.cloudera.org:8080/#/c/23094/5/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/23094/5/tests/custom_cluster/test_admission_controller.py@2170 PS5, Line 2170: test_coord_not_registered_in_ac(self) Can you loop the test body, say, 3x to ensure consistent behavior between coordinator restart? (or intermittent network partition). http://gerrit.cloudera.org:8080/#/c/23094/5/tests/custom_cluster/test_admission_controller.py@2178 PS5, Line 2178: self.execute_query_async(query) Use self.client.execute_async so it consistent with L2185. -- To view, visit http://gerrit.cloudera.org:8080/23094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0 Gerrit-Change-Number: 23094 Gerrit-PatchSet: 5 Gerrit-Owner: Yida Wu <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Wed, 09 Jul 2025 22:12:33 +0000 Gerrit-HasComments: Yes
