Jason Fehr has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23204 )

Change subject: IMPALA-14234: Fix a version mismatch DCHECK hit when admissiond 
cluster membership recovering
......................................................................


Patch Set 3: Code-Review+1

(8 comments)

http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/23204/3/be/src/scheduling/cluster-membership-mgr.cc@305
PS3, Line 305:       }
I'm not quite understanding why the if statement is necessary.  Wouldn't it be 
better to always set new_state->version = current_membership_->version in a 
recovery scenario?  That will happen anyways after recovery has completed.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2238
PS3, Line 2238:     This sequence can trigger a DCHECK hit in AdmissionD.
Any idea what percentage of the time the DCHECK fails?  If we know that number, 
then we should loop this test that many times to be more confident the issue 
has been fixed.

For example, if the DCHECK fails 10% of the time, then loop this test 10 times.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2243
PS3, Line 2243: select count(*) from functional.alltypes limit 1
A coordinator-only query such as "select 1" would be better to eliminate any 
issues with executor communication.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2254
PS3, Line 2254: create_beeswax_client
Nit:  Beeswax is deprecated, use HS2 if possible.  Also in line 2259.


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2271
PS3, Line 2271:     sleep(2)
Use statestore.wait_for_exit() instead of sleep


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2274
PS3, Line 2274:     sleep(3)
Use coord1.wait_for_exit() instead of sleep


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2278
PS3, Line 2278:     handle3 = client2.execute_async(short_query)
Does this query need a slight delay to have a better chance of failing the 
DCHECK?  Possibly the "AC_BEFORE_ADMISSION" debug action could be used?


http://gerrit.cloudera.org:8080/#/c/23204/3/tests/custom_cluster/test_admission_controller.py@2305
PS3, Line 2305:     coord1.start()
Why the need to restart coord1?  This test class appears to be restarting the 
minicluster after each test, thus the next test run will take care of 
restarting the entire minicluster.



--
To view, visit http://gerrit.cloudera.org:8080/23204
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iea77347bd4775abd7866817146e326c7c5042f5e
Gerrit-Change-Number: 23204
Gerrit-PatchSet: 3
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Jason Fehr <jf...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Comment-Date: Thu, 31 Jul 2025 23:18:04 +0000
Gerrit-HasComments: Yes

Reply via email to