Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19608 )
Change subject: [master] Exclude tservers in MAINTENANCE_MODE when leader rebalancing ...................................................................... Patch Set 12: (10 comments) http://gerrit.cloudera.org:8080/#/c/19608/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19608/9//COMMIT_MSG@14 PS9, Line 14: For this reason, we should exclude : such tservers. > Yes. Right -- maintenance mode is just for masters to avoid placing replicas to corresponding tablet servers. I guess if the idea is just to avoid transferring leadership to replicas hosted at tablet servers in maintenance node, this patch should be good enough then. Thanks! http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc File src/kudu/master/auto_leader_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@214 PS12, Line 214: const int kNumTServers = 3; Since this test scenario involves many iterations and pauses, consider adding SKIP_IF_SLOW_NOT_ALLOWED() in the very beginning of this test? http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@215 PS12, Line 215: kNumTablets It is possible to provide the essence of the this test scenario with less number of tablets in the test cluster? I guess the number of iterations/tries to run leader rebalancing depends on this? If yes, then maybe use a smaller number for faster run-time and less computing resources spent during test cycles? http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@244 PS12, Line 244: 20 tries 'leader rebalance' 20 'leader rebalance' iterations http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@244 PS12, Line 244: not is not http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@245 PS12, Line 245: // it's enough to reach leader balanced, more tries is not necessary and less tries : // may not reach leader rebalanced. Does the number somehow depends on the number of tables in the cluster? If so, maybe make that constant to be derived from kNumTablets above? http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@249 PS12, Line 249: CHECK_OK Why CHECK_OK() here instead of ASSERT_OK()? Why it's so important to crash with due to SIGABRT instead of just failing the test if something goes wrong? http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@255 PS12, Line 255: a rebalanced state of leaders the state of balanced leadership distribution http://gerrit.cloudera.org:8080/#/c/19608/12/src/kudu/master/auto_leader_rebalancer-test.cc@261 PS12, Line 261: be have http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.cc File src/kudu/master/auto_leader_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.cc@405 PS9, Line 405: CatalogManager::ScopedLeaderSharedLock leader_lock(catalog_manager_); > Thanks for your advice. The problem exist indeed, unless we use a big lock Right -- having a provisional check to exclude tablet servers in maintenance mode should be good enough. Anyways, even if leader replicas land in a tablet server in the middle when it's being put in the maintenance mode, it should not be a big deal -- Raft takes care of re-elections once the tablet server is down, and existing Kudu client libraries automatically retry write operations on that case. I guess all tablet replicas at quesceing tablet servers simply rejects requests to transfer leadership to them, and it makes sense to see how the leader rebalancing works in that case -- maybe, some updates are needed. Thanks! -- To view, visit http://gerrit.cloudera.org:8080/19608 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2f85a675e69fd02a62e2625881dad2ca5e27acd9 Gerrit-Change-Number: 19608 Gerrit-PatchSet: 12 Gerrit-Owner: Yuqi Du <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: KeDeng <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Comment-Date: Tue, 04 Apr 2023 21:00:35 +0000 Gerrit-HasComments: Yes
