Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18454 )

Change subject: [master] KUDU-3390 support auto rebalance tablet leaders across 
TServers
......................................................................


Patch Set 32:

(35 comments)

Thank you for the patch!

I took a quick first look.

http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18454/29//COMMIT_MSG@23
PS29, Line 23: detail infomations
detailed information


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@7
PS32, Line 7: auto rebalance tablet leaders
I'm curious why this has started with that background task dong the 
rebalancing.  Historically, the current replica rebalancing has appeared first 
as a CLI tool since it was easier to see it in action, and only after some time 
that functionality appeared as a background task run by the catalog manager.

So, a couple of questions here just out of curiousity:

1) Do you plan to add a kudu CLI tool that performs leader rebalancing?
2) If the answer for the first question is 'yes', why did you prefer to start 
with the auto-rebalancing task first?

Thanks!


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10
PS32, Line 10: and
Drop 'and'


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@10
PS32, Line 10: adding
adds


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@20
PS32, Line 20: may cause imbalanced load
It would be nice to quantify this statement, adding measurement similar to 
those you posted for the scan case.


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23
PS32, Line 23: (
nit: add a space before this opening parenthesis


http://gerrit.cloudera.org:8080/#/c/18454/32//COMMIT_MSG@23
PS32, Line 23: detail infomations
details


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc
File src/kudu/master/auto_leader_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@60
PS32, Line 60: consensus_inject_latency_ms_in_notifications
Is this needed here?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@62
PS32, Line 62: raft_heartbeat_interval_ms
And the same for several other declarations of gflags in this section: are 
these needed?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@70
PS32, Line 70: METRIC_DECLARE_gauge_int32(tablet_copy_open_client_sessions);
             : METRIC_DECLARE_counter(tablet_copy_bytes_fetched);
             : METRIC_DECLARE_counter(tablet_copy_bytes_sent);
Are these needed in this file?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@139
PS32, Line 139: and leader balanced
How do we know that the distribution of the leader replicas is balanced?  I 
didn't see anything in the scenario below that guarantees that, but maybe I'm 
missing something.


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@150
PS32, Line 150:   CreateWorkloadTable(kNumTablets, /*num_replicas*/ 3);
How do we know at this point that the cluster is leader-balanced?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@168
PS32, Line 168:   SleepFor(MonoDelta::FromSeconds(20));
That's a really long sleep time: that's not a good idea to have something like 
this in the test because it increases the run-time of the whole test suite.  Is 
it possible to use some custom setting for the parameters of the 
auto-rebalancing task to shorten this?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@169
PS32, Line 169: 50
I'm curious why is the number of retries so high?  Is it possible to make the 
rebalancer making more progress each iteration instead just in the sake of 
shortening run-time of this test scenario?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer-test.cc@175
PS32, Line 175:     SleepFor(MonoDelta::FromSeconds(2));
Why is it necessary to have a pause here?  Would be great to add a comment to 
clarify.


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h
File src/kudu/master/auto_leader_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h@45
PS32, Line 45: // A CatalogManager background task which auto-rebalances 
tablets' leaders distribution.
nit: remove this period


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h@46
PS32, Line 46: by the way leader transfer
by transferring leadership between tablet replicas


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.h@95
PS32, Line 95: std::shared_ptr<rpc::Messenger> messenger_;
nit: please document this member as well -- what is it for?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc
File src/kudu/master/auto_leader_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@55
PS32, Line 55: using std::map;
             : using std::nullopt;
             : using std::string;
             : using std::vector;
             :
             : using kudu::consensus::ConsensusServiceProxy;
             : using kudu::consensus::LeaderStepDownMode;
             : using kudu::consensus::LeaderStepDownRequestPB;
             : using kudu::consensus::LeaderStepDownResponsePB;
             : using kudu::consensus::RaftPeerPB;
             :
             : using kudu::rpc::MessengerBuilder;
             : using kudu::rpc::RpcController;
             : using strings::Substitute;
nit: remove empty lines and sort the rest of non-empty lines alphabetically


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@293
PS32, Line 293: vary
very?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@293
PS32, Line 293: synchronized rpc
What's "synchronized rpc"?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@299
PS32, Line 299: LOG(INFO)
I don't think this sort of information is worth it having in the INFO log: 
maybe, change this into VLOG(1) instead?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@304
PS32, Line 304: string
nit: could this be 'const string&' or 'const auto&' to avoid copying?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@330
PS32, Line 330: strings::
Drop the namespace prefix since there is corresponding 'using' using 
strings::Substitute' in the beginning of this file.


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/auto_leader_rebalancer.cc@337
PS32, Line 337: VLOG(0)
What's VLOG(0), and why not to use VLOG(1) for this?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@353
PS32, Line 353: auto-leader-rebalancing
automatic leader rebalancing


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1062
PS32, Line 1062: depend
depends


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1062
PS32, Line 1062: good replicas balance
Do you mean it relies on the even distribution of tablet replicas across all 
existing tablet servers, so every tablet server contains almost the same number 
of tablet replicas as any other tablet server?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1063
PS32, Line 1063: rebalance
rebalancing


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1063
PS32, Line 1063: auto_rebalancing
auto-rebalancing


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1064
PS32, Line 1064: load rebalance
What is 'load rebalance'?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1064
PS32, Line 1064: cann't
cannot


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1067
PS32, Line 1067: becaunse
because


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1067
PS32, Line 1067: follows
followers?


http://gerrit.cloudera.org:8080/#/c/18454/32/src/kudu/master/catalog_manager.cc@1067
PS32, Line 1067: propotion
?



--
To view, visit http://gerrit.cloudera.org:8080/18454
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibfb60d8759a93b6a19238637c27df4f6b1cac918
Gerrit-Change-Number: 18454
Gerrit-PatchSet: 32
Gerrit-Owner: Yuqi Du <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Yuqi Du <[email protected]>
Gerrit-Comment-Date: Sat, 01 Oct 2022 01:58:09 +0000
Gerrit-HasComments: Yes

Reply via email to