[kudu-CR] [master] KUDU-3390 support auto rebalance tablet leaders across TServers

Yuqi Du (Code Review) Wed, 05 Oct 2022 21:07:23 -0700

Yuqi Du has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18454 )


Change subject: [master] KUDU-3390 support auto rebalance tablet leaders across 
TServers
......................................................................


Patch Set 32:

(12 comments)

Thanks for your crs.

http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG@23
PS29, Line 23: detail infomations
> detailed information
Done


http://gerrit.cloudera.org:8080/#/c/18454/30//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18454/30//COMMIT_MSG@20
PS30, Line 20: the queue, pipeline, batc
> Maybe you need to explain what leads to an unbalanced load, I really don't
The reason I called it hidden variables  is that I cann't expain it very 
clearly also.
The imbalance's result based my experience. Some facts can expain partially:

leader's write request conresponding to follower's UpdateConsensus request,
the two process is diffenent, such as theads and queue, pipeline and batch size 
.

My words may cause confuse, I will fix the words.


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@7
PS32, Line 7: auto rebalance tablet leaders
> I'm curious why this has started with that background task dong the rebalan
users can use kudu CLI leader_step_down command and write a script program to 
rebalance the leaders. SREs should make the rebalance script run periodically.

Our company have more than 1500+ kudu clusters and more and more kudu clusters 
will be deployed, so it's hard that SREs maintenance the rebalance script 
tasks. The better way is kudu leader rebalance itself.

And kudu has the 'auto-rebalancer' and has no 'auto-leader-rebalancer', so add 
'auto-leader-rebalancer' is naturally.


1) Yes, I plan adding a new CLI tool to replace the old one, because it may 
conflict to the new 'auto-rebalancer', 'auto-rebalancer'. That need another 
discussion. The purpose of the CLI is for some special situation, eg when user 
turn off 'auto-leader-rebalancer' and need run it once. Because I did SRE' work 
before, I understand they have kinds of requirements.
2)  auto-rebalancing task would do replicas rebalanced, then leader rebalancing 
task can reach the best effect.
I think 'auto-rebalancing' and 'auto-lead-rebalancing' both should turn on, but 
it's no matter who first turn on.


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10
PS32, Line 10: adding
> adds
Done


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10
PS32, Line 10: and
> Drop 'and'
Done


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@20
PS32, Line 20: may cause imbalanced load
> It would be nice to quantify this statement, adding measurement similar to
This is qualitative analysis, of course quantitative analysis is more better. 
But I have no direct evidence for this, no data for this. I can not tell you 
the proportion of write skew in imbalanced load.

I have found several reviewers referred to this, what do you suggest about this?
In order to avoid confused and not clearly statements, I should remove it, only 
keep the main reason?


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23
PS32, Line 23: detail infomations
> details
Done


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23
PS32, Line 23: (
> nit: add a space before this opening parenthesis
Done


http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/auto_leader_rebalancer.cc
File src/kudu/master/auto_leader_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/auto_leader_rebalancer.cc@287
PS30, Line 287: moves_scheduled_this_round_for_test_ = 
leader_transfer_tasks.size();
              :   VLOG(1) << Substitute("leader rebalance tasks, size
> Besides, if the leadership of many tablets is changed, clients will send mo
OK
Add a flag for this.


http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/auto_leader_rebalancer.cc@350
PS30, Line 350:   {
> If there are some temporarily unavailable tservers in a cluster, is it a go
Simply, skip it and next leader rebalance can makeit rebalanced also.
Because the tserver is unsteady just now, delayed leader rebalance is ok.
If a kudu cluster is extreme unsteady, the problem is not the leader rebalance 
to solve.


http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/catalog_manager.cc@352
PS30, Line 352: auto_leader_rebalancing_enabled
> This flag should be tagged with 'runtime' too.
OK, done


http://gerrit.cloudera.org:8080/#/c/18454/30/src/kudu/master/catalog_manager.cc@1061
PS30, Line 1061:
               :   // Leader rebalancer depend on a good replicas balance, that 
means we'd better enable
               :   // auto_rebalancing. If auto_rebalancing is disabled and 
leader rebalance is enabled,
               :
> I think it is worth pointing out the difference between rebalancing tablet
If auto_rebalancing is disabled and leader rebalance is enabled, the algorithm 
can also work, becase our algorithm is keeping a propotion of leaders/follows 
(1 : replication_refactor - 1) for every tserver' every table.
It's no matter to whether replicas balanced.



--
To view, visit http://gerrit.cloudera.org:8080/18454
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibfb60d8759a93b6a19238637c27df4f6b1cac918
Gerrit-Change-Number: 18454
Gerrit-PatchSet: 32
Gerrit-Owner: Yuqi Du <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Yuqi Du <[email protected]>
Gerrit-Comment-Date: Thu, 06 Oct 2022 04:03:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-3390 support auto rebalance tablet leaders across TServers

Reply via email to