Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19608 )
Change subject: [master] Exclude tservers in MAINTENANCE_MODE when leader rebalancing ...................................................................... Patch Set 9: (8 comments) http://gerrit.cloudera.org:8080/#/c/19608/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19608/9//COMMIT_MSG@14 PS9, Line 14: For this reason, we should exclude : such tservers. Having this patch is great, but it seems there is a race condition in this approach. Shouldn't tablet replicas which are at tablet servers in the maintenance mode just refuse to become leaders? http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc File src/kudu/master/auto_leader_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@234 PS9, Line 234: string exe_file; : CHECK_OK(Env::Default()->GetExecutablePath(&exe_file)); : const string kudu_cli_path = DirName(exe_file); : : // Make 1 tserver enter MAINTENANCE_MODE. : ASSERT_OK(Subprocess::Call(Substitute("$0/kudu tserver state enter_maintenance $1 $2", : kudu_cli_path, : master_addresses, : mini_tserver->uuid()))); > Is it possible to call TSManager::SetTServerState directly? +1 http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@251 PS9, Line 251: mini_tserver->Restart(); Wrap this into ASSERT_OK()? Also, it would be great to add a comment on the reason behind restarting this tablet server. http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@253 PS9, Line 253: SleepFor(MonoDelta::FromSeconds(10 * FLAGS_auto_rebalancing_interval_seconds)); nit: please add a comment on the purpose of this pause http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@254 PS9, Line 254: 20 Why 20? Why not 10 or 100? http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@264 PS9, Line 264: CheckLeaderBalance().ok() Does it make sense to check for exact Status code? And error message pattern? http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.h File src/kudu/master/auto_leader_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.h@79 PS9, Line 79: std::set<std::string> Is it crucial to have the set of UUIDs to be ordered? If not, consider using std::unordered_set because of faster lookup times if the set is large? http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.cc File src/kudu/master/auto_leader_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.cc@405 PS9, Line 405: RunLeaderRebalanceForTable(table_info, tserver_uuids, exclude_dest_uuids); What if one more tablet server is put into the maintenance mode just between the list is built above in lines 391-397 and when this RunLeaderRebalanceForTable() is called? -- To view, visit http://gerrit.cloudera.org:8080/19608 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2f85a675e69fd02a62e2625881dad2ca5e27acd9 Gerrit-Change-Number: 19608 Gerrit-PatchSet: 9 Gerrit-Owner: Yuqi Du <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: KeDeng <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Comment-Date: Tue, 28 Mar 2023 02:31:49 +0000 Gerrit-HasComments: Yes
