Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 13: (15 comments) Few comments after first initial look. http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@76 PS13, Line 76: // Assign num_locations unique locations to the first num_locations tservers. : for (int i = 0; i < num_locations; ++i) { : descs[i]->AssignLocationForTesting(Substitute("L$0", i)); : } What if num_locations turns to be greater than num_tservers? Most likely that would be a programming mistake, but maybe add an assertion to make sure it's properly handled if happens. >From the other side, why to pass 'num_tservers' as a parameter at all? Why >not to get that information from the 'cluster_' member: >cluster_->num_tablet_servers() ? http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@93 PS13, Line 93: moves_scheduled_this_round_for_test(leader_idx), 0 nit: in ASSERT_EQ(), the expected value comes first -- that way it's easier to read the error message if this assertion ever fails. http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@196 PS13, Line 196: skewed nit: no longer balanced http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@208 PS13, Line 208: SleepFor(MonoDelta::FromSeconds(1)); Here and below: is there a more reliable way of knowing the auto-rebalancer is up and running? Like wrapping the desired number of iterations returned by CatalogManager::auto_rebalancer_iterations() into ASSERT_EVENTUALLY() ? Without going into the details, my concern here is the inherent flakiness of tests based on timing assumptions like this one. We have witnessed these timings being far off due to various scheduler anomalies. http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@258 PS13, Line 258: locations.size(), kNumTservers nit here and elsewhere: the expected value comes first in {ASSERT,EXPECT}_EQ http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@260 PS13, Line 260: TestWorkload workload(cluster_.get()); : workload.set_num_tablets(kNumTablets); : workload.set_num_replicas(1); : workload.Setup(); This pattern repeats in a few places. Maybe, separate this into a method/function of the tablet number and the replication factor? http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer-test.cc@268 PS13, Line 268: If placement policy can never be satisfied, the cluster is balanced This statement is too bold. :) Replace with: If placement policy cannot be satisfied, the auto-rebalancer should not attempt to schedule any replica movements in vain. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@147 PS5, Line 147: > This value isn't actually used in this implementation of the auto-rebalance I think the best option here is to introduce a configurable parameter for the staleness interval. As of now it's totally fine to use a flag for that. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@491 PS5, Line 491: ask::BuildClusterRawInfo( > Made it a configurable gflag! Great, thank you! http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@131 PS13, Line 131: moves movement? http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@194 PS13, Line 194: will contain at most one move Maybe, add a DCHECK() for this pre-condition inside the loop below to spot issues, if any? http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@282 PS13, Line 282: CheckReplicaMovesCompleted What if the tablet has been dropped along with all its replicas? Will the auto-rebalancer thread stuck in this sub-cycle forever then? http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/auto_rebalancer.cc@276 PS13, Line 276: s = CheckReplicaMovesCompleted(&replica_moves, &all_moves_complete); : while (!all_moves_complete) { : if (!s.ok()) { : LOG(WARNING) << Substitute("could not perform replica move: $0", s.ToString()); : } : SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_wait_for_replica_moves_seconds)); : s = CheckReplicaMovesCompleted(&replica_moves, &all_moves_complete); : } nit: there is 'do {} while ();' construct which might be a better fit here code-wise http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/catalog_manager.cc@794 PS13, Line 794: auto_rebalancer_.reset(new AutoRebalancerTask(this, master_->ts_manager())); : RETURN_NOT_OK_PREPEND(auto_rebalancer_->Init(), : "failed to initialize auto-rebalancing task"); nit: it's safer to reset the auto_rebalancer_ member with properly initialized object because the code attempts to call methods like Shutdown() on exit. Consider replacing with: unique_ptr<AutoRebalancerTask> task(new ...); RETURN_NOT_OK_PREPEND(task->Init(), ...); auto_rebalancer_ = std::move(task); http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/ts_descriptor.h File src/kudu/master/ts_descriptor.h: http://gerrit.cloudera.org:8080/#/c/14177/13/src/kudu/master/ts_descriptor.h@188 PS13, Line 188: location_ = loc Since the parameter is passed by value, it seems this should have been location_ = std::move(loc) ? -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 13 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Mon, 03 Feb 2020 23:59:12 +0000 Gerrit-HasComments: Yes
