Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/18454 )
Change subject: [master] KUDU-3390 support auto rebalance tablet leaders across TServers ...................................................................... Patch Set 32: (35 comments) Thank you for the patch! I took a quick first look. http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG@23 PS29, Line 23: detail infomations detailed information http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@7 PS32, Line 7: auto rebalance tablet leaders I'm curious why this has started with that background task dong the rebalancing. Historically, the current replica rebalancing has appeared first as a CLI tool since it was easier to see it in action, and only after some time that functionality appeared as a background task run by the catalog manager. So, a couple of questions here just out of curiousity: 1) Do you plan to add a kudu CLI tool that performs leader rebalancing? 2) If the answer for the first question is 'yes', why did you prefer to start with the auto-rebalancing task first? Thanks! http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10 PS32, Line 10: and Drop 'and' http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10 PS32, Line 10: adding adds http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@20 PS32, Line 20: may cause imbalanced load It would be nice to quantify this statement, adding measurement similar to those you posted for the scan case. http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23 PS32, Line 23: ( nit: add a space before this opening parenthesis http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23 PS32, Line 23: detail infomations details http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc File src/kudu/master/auto_leader_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@60 PS32, Line 60: consensus_inject_latency_ms_in_notifications Is this needed here? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@62 PS32, Line 62: raft_heartbeat_interval_ms And the same for several other declarations of gflags in this section: are these needed? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@70 PS32, Line 70: METRIC_DECLARE_gauge_int32(tablet_copy_open_client_sessions); : METRIC_DECLARE_counter(tablet_copy_bytes_fetched); : METRIC_DECLARE_counter(tablet_copy_bytes_sent); Are these needed in this file? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@139 PS32, Line 139: and leader balanced How do we know that the distribution of the leader replicas is balanced? I didn't see anything in the scenario below that guarantees that, but maybe I'm missing something. http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@150 PS32, Line 150: CreateWorkloadTable(kNumTablets, /*num_replicas*/ 3); How do we know at this point that the cluster is leader-balanced? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@168 PS32, Line 168: SleepFor(MonoDelta::FromSeconds(20)); That's a really long sleep time: that's not a good idea to have something like this in the test because it increases the run-time of the whole test suite. Is it possible to use some custom setting for the parameters of the auto-rebalancing task to shorten this? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@169 PS32, Line 169: 50 I'm curious why is the number of retries so high? Is it possible to make the rebalancer making more progress each iteration instead just in the sake of shortening run-time of this test scenario? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@175 PS32, Line 175: SleepFor(MonoDelta::FromSeconds(2)); Why is it necessary to have a pause here? Would be great to add a comment to clarify. http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h File src/kudu/master/auto_leader_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h@45 PS32, Line 45: // A CatalogManager background task which auto-rebalances tablets' leaders distribution. nit: remove this period http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h@46 PS32, Line 46: by the way leader transfer by transferring leadership between tablet replicas http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h@95 PS32, Line 95: std::shared_ptr<rpc::Messenger> messenger_; nit: please document this member as well -- what is it for? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc File src/kudu/master/auto_leader_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@55 PS32, Line 55: using std::map; : using std::nullopt; : using std::string; : using std::vector; : : using kudu::consensus::ConsensusServiceProxy; : using kudu::consensus::LeaderStepDownMode; : using kudu::consensus::LeaderStepDownRequestPB; : using kudu::consensus::LeaderStepDownResponsePB; : using kudu::consensus::RaftPeerPB; : : using kudu::rpc::MessengerBuilder; : using kudu::rpc::RpcController; : using strings::Substitute; nit: remove empty lines and sort the rest of non-empty lines alphabetically http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@293 PS32, Line 293: vary very? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@293 PS32, Line 293: synchronized rpc What's "synchronized rpc"? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@299 PS32, Line 299: LOG(INFO) I don't think this sort of information is worth it having in the INFO log: maybe, change this into VLOG(1) instead? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@304 PS32, Line 304: string nit: could this be 'const string&' or 'const auto&' to avoid copying? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@330 PS32, Line 330: strings:: Drop the namespace prefix since there is corresponding 'using' using strings::Substitute' in the beginning of this file. http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@337 PS32, Line 337: VLOG(0) What's VLOG(0), and why not to use VLOG(1) for this? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@353 PS32, Line 353: auto-leader-rebalancing automatic leader rebalancing http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1062 PS32, Line 1062: depend depends http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1062 PS32, Line 1062: good replicas balance Do you mean it relies on the even distribution of tablet replicas across all existing tablet servers, so every tablet server contains almost the same number of tablet replicas as any other tablet server? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1063 PS32, Line 1063: rebalance rebalancing http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1063 PS32, Line 1063: auto_rebalancing auto-rebalancing http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1064 PS32, Line 1064: load rebalance What is 'load rebalance'? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1064 PS32, Line 1064: cann't cannot http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1067 PS32, Line 1067: becaunse because http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1067 PS32, Line 1067: follows followers? http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1067 PS32, Line 1067: propotion ? -- To view, visit http://gerrit.cloudera.org:8080/18454 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibfb60d8759a93b6a19238637c27df4f6b1cac918 Gerrit-Change-Number: 18454 Gerrit-PatchSet: 32 Gerrit-Owner: Yuqi Du <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Comment-Date: Sat, 01 Oct 2022 01:58:09 +0000 Gerrit-HasComments: Yes
