Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/18454 )
Change subject: [master] KUDU-3061 support auto rebalance tablet leaders across TServers ...................................................................... Patch Set 19: (10 comments) Thanks your crs. Very Kind. > Patch Set 19: > > (1 comment) > > I didn't really look into the codes because I'd like to understand the design > consideration and the use of algorithm first. > Here are some questions I don't know much about: > - Is this leader rebalance task only transfer the leadership between replicas > of the same tablet? Or it also copy replicas from one server to another? Only do leader transfer, do not copy replicas. > - Do we have to rebalance replicas in a cluster before we run this leader > rebalance task? Yes, it's better enable rebalance replicas. If enable leader rebalancer but disable auto rebalancer the algorithm work well but the effect is not good. The algorithm can be convergence, and the algorithm's target is every tserver' replicas, number of leader : number of follower is 1 : (replica_refactor -1). > - Can we finally get a fully balanced cluster in all cases? Yes, we should enable auto rebalancer and rebalancer leader rebalancer together. > > If you already have a design document about this issue, can you share it so > wen can discuss about the details. I'm sorry, I only write an informal document in chinese. The basic and core idea can be described as: For a specific table, its replicas assigned to tservers(by create table or rebalance task), at a specific tserver should satisfy number of leaders : number of followers is 1 : (replica_refactor -1). every table should do leader rebalance. http://gerrit.cloudera.org:8080/#/c/18454/19//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18454/19//COMMIT_MSG@14 PS19, Line 14: Two reasons of load skew: > Besides clarify why we need to balance leader replicas across TServers, it ok, I' ll do this at unit test. http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/client/flex_partitioning_client-test.cc File src/kudu/client/flex_partitioning_client-test.cc: http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/client/flex_partitioning_client-test.cc@341 PS19, Line 341: // Using one-level hash bucketing { 3, "key" } as the custom hash schema > Sorry, I didn't get the point. Is this related to this patch? According to crs before, GetTableInfo() is changed. catalog_manager.h#771. http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc File src/kudu/master/auto_leader_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc@68 PS19, Line 68: 10, > nit: would it be more convenient if the gflag name and default value are in Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc@72 PS19, Line 72: 3600, > same Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc@96 PS19, Line 96: "catalog-manager" > nit: better to use the same category name, i.e. "catalog manager", as other Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc@103 PS19, Line 103: if (thread_) { > Is there any case thread_ is not initialized? Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc@181 PS19, Line 181: const vector<string>& leaders = FindWithDefault(uuid_leaders_map, uuid, {}); : int32_t leader_count = leaders.size(); > nit: move these code closer to the place where it used, means after L188. Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_leader_rebalancer.cc@192 PS19, Line 192: int32_t should_transfer_count = leader_count - (replica_count / table_data.num_replicas() + 1); > Could you please add some comments to describe the algorithm? Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_rebalancer.cc@a181 PS19, Line 181: > why remove it? Done http://gerrit.cloudera.org:8080/#/c/18454/19/src/kudu/master/auto_rebalancer.cc@177 PS19, Line 177: [this]() { this->RunLoop(); }, &thread_); : > nit: alignment and remove blank line. Done -- To view, visit http://gerrit.cloudera.org:8080/18454 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibfb60d8759a93b6a19238637c27df4f6b1cac918 Gerrit-Change-Number: 18454 Gerrit-PatchSet: 19 Gerrit-Owner: Yuqi Du <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Comment-Date: Wed, 06 Jul 2022 07:19:37 +0000 Gerrit-HasComments: Yes
