Jason Fehr has posted comments on this change. ( http://gerrit.cloudera.org:8080/23204 )
Change subject: IMPALA-14234: Fix a version mismatch DCHECK hit when admissiond cluster membership recovering ...................................................................... Patch Set 3: Code-Review+1 (8 comments) http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc File be/src/scheduling/cluster-membership-mgr.cc: http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc@305 PS3, Line 305: } I'm not quite understanding why the if statement is necessary. Wouldn't it be better to always set new_state->version = current_membership_->version in a recovery scenario? That will happen anyways after recovery has completed. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2238 PS3, Line 2238: This sequence can trigger a DCHECK hit in AdmissionD. Any idea what percentage of the time the DCHECK fails? If we know that number, then we should loop this test that many times to be more confident the issue has been fixed. For example, if the DCHECK fails 10% of the time, then loop this test 10 times. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2243 PS3, Line 2243: select count(*) from functional.alltypes limit 1 A coordinator-only query such as "select 1" would be better to eliminate any issues with executor communication. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2254 PS3, Line 2254: create_beeswax_client Nit: Beeswax is deprecated, use HS2 if possible. Also in line 2259. http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2271 PS3, Line 2271: sleep(2) Use statestore.wait_for_exit() instead of sleep http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2274 PS3, Line 2274: sleep(3) Use coord1.wait_for_exit() instead of sleep http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2278 PS3, Line 2278: handle3 = client2.execute_async(short_query) Does this query need a slight delay to have a better chance of failing the DCHECK? Possibly the "AC_BEFORE_ADMISSION" debug action could be used? http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2305 PS3, Line 2305: coord1.start() Why the need to restart coord1? This test class appears to be restarting the minicluster after each test, thus the next test run will take care of restarting the entire minicluster. -- To view, visit http://gerrit.cloudera.org:8080/23204 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iea77347bd4775abd7866817146e326c7c5042f5e Gerrit-Change-Number: 23204 Gerrit-PatchSet: 3 Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Jason Fehr <jf...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com> Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com> Gerrit-Comment-Date: Thu, 31 Jul 2025 23:18:04 +0000 Gerrit-HasComments: Yes