Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 13:

(15 comments)

Few comments after first initial look.

http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@76
PS13, Line 76:     // Assign num_locations unique locations to the first 
num_locations tservers.
             :     for (int i = 0; i < num_locations; ++i) {
             :       descs[i]->AssignLocationForTesting(Substitute("L$0", i));
             :     }
What if num_locations turns to be greater than num_tservers?  Most likely that 
would be a programming mistake, but maybe add an assertion to make sure it's 
properly handled if happens.

>From the other side, why to pass 'num_tservers' as a parameter at all?  Why 
>not to get that information from the 'cluster_' member: 
>cluster_->num_tablet_servers() ?


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@93
PS13, Line 93: moves_scheduled_this_round_for_test(leader_idx), 0
nit: in ASSERT_EQ(), the expected value comes first -- that way it's easier to 
read the error message if this assertion ever fails.


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@196
PS13, Line 196: skewed
nit: no longer balanced


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@208
PS13, Line 208: SleepFor(MonoDelta::FromSeconds(1));
Here and below: is there a more reliable way of knowing the auto-rebalancer is 
up and running?  Like wrapping the desired number of iterations returned by 
CatalogManager::auto_rebalancer_iterations() into ASSERT_EVENTUALLY() ?

Without going into the details, my concern here is the inherent flakiness of 
tests based on timing assumptions like this one.  We have witnessed these 
timings being far off due to various scheduler anomalies.


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@258
PS13, Line 258: locations.size(), kNumTservers
nit here and elsewhere: the expected value comes first in {ASSERT,EXPECT}_EQ


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@260
PS13, Line 260:   TestWorkload workload(cluster_.get());
              :   workload.set_num_tablets(kNumTablets);
              :   workload.set_num_replicas(1);
              :   workload.Setup();
This pattern repeats in a few places.  Maybe, separate this into a 
method/function of the tablet number and the replication factor?


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@268
PS13, Line 268: If placement policy can never be satisfied, the cluster is 
balanced
This statement is too bold. :)

Replace with:

If placement policy cannot be satisfied, the auto-rebalancer should not attempt 
to schedule any replica movements in vain.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@147
PS5, Line 147:
> This value isn't actually used in this implementation of the auto-rebalance
I think the best option here is to introduce a configurable parameter for the 
staleness interval.  As of now it's totally fine to use a flag for that.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@491
PS5, Line 491: ask::BuildClusterRawInfo(
> Made it a configurable gflag!
Great, thank you!


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@131
PS13, Line 131: moves
movement?


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@194
PS13, Line 194: will contain at most one move
Maybe, add a DCHECK() for this pre-condition inside the loop below to spot 
issues, if any?


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@282
PS13, Line 282: CheckReplicaMovesCompleted
What if the tablet has been dropped along with all its replicas?  Will the 
auto-rebalancer thread stuck in this sub-cycle forever then?


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@276
PS13, Line 276:     s = CheckReplicaMovesCompleted(&replica_moves, 
&all_moves_complete);
              :     while (!all_moves_complete) {
              :       if (!s.ok()) {
              :         LOG(WARNING) << Substitute("could not perform replica 
move: $0", s.ToString());
              :       }
              :       
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_wait_for_replica_moves_seconds));
              :       s = CheckReplicaMovesCompleted(&replica_moves, 
&all_moves_complete);
              :     }
nit: there is 'do {} while ();' construct which might be a better fit here 
code-wise


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/catalog_manager.cc@794
PS13, Line 794:     auto_rebalancer_.reset(new AutoRebalancerTask(this, 
master_->ts_manager()));
              :     RETURN_NOT_OK_PREPEND(auto_rebalancer_->Init(),
              :                           "failed to initialize 
auto-rebalancing task");
nit: it's safer to reset the auto_rebalancer_ member with properly initialized 
object because the code attempts to call methods like Shutdown() on exit.  
Consider replacing with:

  unique_ptr<AutoRebalancerTask> task(new ...);
  RETURN_NOT_OK_PREPEND(task->Init(), ...);
  auto_rebalancer_ = std::move(task);


http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/ts_descriptor.h
File src/kudu/master/ts_descriptor.h:

http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/ts_descriptor.h@188
PS13, Line 188: location_ = loc
Since the parameter is passed by value, it seems this should have been

  location_ = std::move(loc)

?



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 13
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 03 Feb 2020 23:59:12 +0000
Gerrit-HasComments: Yes

Reply via email to