Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/18454 )
Change subject: [master] KUDU-3390 support auto rebalance tablet leaders across TServers ...................................................................... Patch Set 32: (12 comments) Thanks for your crs. http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG@23 PS29, Line 23: detail infomations > detailed information Done http://gerrit.cloudera.org:8080/#/c/18454/30//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18454/30//COMMIT_MSG@20 PS30, Line 20: the queue, pipeline, batc > Maybe you need to explain what leads to an unbalanced load, I really don't The reason I called it hidden variables is that I cann't expain it very clearly also. The imbalance's result based my experience. Some facts can expain partially: leader's write request conresponding to follower's UpdateConsensus request, the two process is diffenent, such as theads and queue, pipeline and batch size . My words may cause confuse, I will fix the words. http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@7 PS32, Line 7: auto rebalance tablet leaders > I'm curious why this has started with that background task dong the rebalan users can use kudu CLI leader_step_down command and write a script program to rebalance the leaders. SREs should make the rebalance script run periodically. Our company have more than 1500+ kudu clusters and more and more kudu clusters will be deployed, so it's hard that SREs maintenance the rebalance script tasks. The better way is kudu leader rebalance itself. And kudu has the 'auto-rebalancer' and has no 'auto-leader-rebalancer', so add 'auto-leader-rebalancer' is naturally. 1) Yes, I plan adding a new CLI tool to replace the old one, because it may conflict to the new 'auto-rebalancer', 'auto-rebalancer'. That need another discussion. The purpose of the CLI is for some special situation, eg when user turn off 'auto-leader-rebalancer' and need run it once. Because I did SRE' work before, I understand they have kinds of requirements. 2) auto-rebalancing task would do replicas rebalanced, then leader rebalancing task can reach the best effect. I think 'auto-rebalancing' and 'auto-lead-rebalancing' both should turn on, but it's no matter who first turn on. http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10 PS32, Line 10: adding > adds Done http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10 PS32, Line 10: and > Drop 'and' Done http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@20 PS32, Line 20: may cause imbalanced load > It would be nice to quantify this statement, adding measurement similar to This is qualitative analysis, of course quantitative analysis is more better. But I have no direct evidence for this, no data for this. I can not tell you the proportion of write skew in imbalanced load. I have found several reviewers referred to this, what do you suggest about this? In order to avoid confused and not clearly statements, I should remove it, only keep the main reason? http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23 PS32, Line 23: detail infomations > details Done http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23 PS32, Line 23: ( > nit: add a space before this opening parenthesis Done http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/auto_leader_rebalancer.cc File src/kudu/master/auto_leader_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/auto_leader_rebalancer.cc@287 PS30, Line 287: moves_scheduled_this_round_for_test_ = leader_transfer_tasks.size(); : VLOG(1) << Substitute("leader rebalance tasks, size > Besides, if the leadership of many tablets is changed, clients will send mo OK Add a flag for this. http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/auto_leader_rebalancer.cc@350 PS30, Line 350: { > If there are some temporarily unavailable tservers in a cluster, is it a go Simply, skip it and next leader rebalance can makeit rebalanced also. Because the tserver is unsteady just now, delayed leader rebalance is ok. If a kudu cluster is extreme unsteady, the problem is not the leader rebalance to solve. http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/catalog_manager.cc@352 PS30, Line 352: auto_leader_rebalancing_enabled > This flag should be tagged with 'runtime' too. OK, done http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/catalog_manager.cc@1061 PS30, Line 1061: : // Leader rebalancer depend on a good replicas balance, that means we'd better enable : // auto_rebalancing. If auto_rebalancing is disabled and leader rebalance is enabled, : > I think it is worth pointing out the difference between rebalancing tablet If auto_rebalancing is disabled and leader rebalance is enabled, the algorithm can also work, becase our algorithm is keeping a propotion of leaders/follows (1 : replication_refactor - 1) for every tserver' every table. It's no matter to whether replicas balanced. -- To view, visit http://gerrit.cloudera.org:8080/18454 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibfb60d8759a93b6a19238637c27df4f6b1cac918 Gerrit-Change-Number: 18454 Gerrit-PatchSet: 32 Gerrit-Owner: Yuqi Du <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Comment-Date: Thu, 06 Oct 2022 04:03:09 +0000 Gerrit-HasComments: Yes
