Hannah Nguyen has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: WIP KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 5: (64 comments) http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@50 PS5, Line 50: FLAGS_auto_rebalancing_stop_flag > You probably meant to set the interval here. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@87 PS5, Line 87: : if (!AllowSlowTests()) { : LOG(WARNING) << "test is skipped; set KUDU_ALLOW_SLOW_TESTS=1 to run"; : return; : } > nit: you can use SKIP_IF_SLOW_NOT_ALLOWED() for this. Same elsewhere. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@111 PS5, Line 111: } : else { > Join the lines for better readability. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@104 PS5, Line 104: SleepFor(MonoDelta::FromSeconds(10)); : : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : for (int i=0; i<kNumMasters; i++) { : if (i==leader_idx) { : ASSERT_NE(0, number_of_loop_iterations(i)); : } : else { : ASSERT_EQ(0, moves_scheduled_this_round(i)); : } : } : > Rather than sleeping this long, how about asserting with ASSERT_EVENTUALLY Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@117 PS5, Line 117: cluster_->Shutdown(); > nit: don't need this since it gets called in TearDown() Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@144 PS5, Line 144: auto iterations = number_of_loop_iterations(new_leader_idx); : ASSERT_LT(0, iterations); > Should this be wrapped in ASSERT_EVENTUALLY? What if the thread hasn't star Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@149 PS5, Line 149: TEST_F(AutoRebalancerTest, NoReplicaMovesIfBalanced) { > Consider breaking this into smaller tests so the cases are parallelizable. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@200 PS5, Line 200: num_tservers = 2; : num_tablets = 1; > It doesn't seem particularly useful to keep defining these separately, and Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@224 PS5, Line 224: SleepFor(MonoDelta::FromSeconds(10)); : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_EQ(moves_scheduled_this_round(leader_idx), 0); > I wonder if it's sufficient to just define "no moves" as having the loop it Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@263 PS5, Line 263: ASSERT_GE > ASSERT_EQ()? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@63 PS5, Line 63: // the catalog manager must be in the process of initializing. > nit: Why does this need to be the case? The only invariant I see right now Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@71 PS5, Line 71: int number_of_loop_iterations; : int moves_scheduled_this_round; > Nit: could you suffix these with _for_test so it's obvious that their acces Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@70 PS5, Line 70: // Variables for testing. : int number_of_loop_iterations; : int moves_scheduled_this_round; > Is it possible to make this variables private and declare the test that acc Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@85 PS5, Line 85: const boost::optional<std::string>& location > Please document the semantics of the 'location' parameter -- it's not self Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@88 PS5, Line 88: Status GetMovesUsingRebalancingAlgo( : const rebalance::ClusterRawInfo& raw_info, : std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves, : rebalance::RebalancingAlgo* algo, : bool cross_location = false) const; : : Status GetMoves(std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves, : bool* policy_fixing) const; : : Status GetTabletLeader( : const std::string& tablet_id, : std::string* leader_uuid, : HostPort* leader_hp) const; : : Status ExecuteMoves( : const std::vector<rebalance::Rebalancer::ReplicaMove>& replica_moves, : const bool& policy_fixing); > nit: add docs. What do these and their arguments do? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@122 PS5, Line 122: rebalancer_ > Does this _need_ to be heap allocated? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@133 PS5, Line 133: > nit: an extra line Done http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc@377 PS3, Line 377: unordered_set<string> tablets_in_move; : for (const auto& move : moves) { : vector<string> tablet_ids; : RETURN_NOT_OK(rebalancer_->FindReplicas(move, raw_info, &tablet_ids)); : if (cross_location) { : // In case of cross-location (a.k.a. inter-location) rebalancing it is : // necessary to make sure the majority of replicas would not end up : // at the same location after the move. If so, remove those tablets : // from the list of candidates. : RETURN_NOT_OK(rebalancer_->FilterCrossLocationTabletCandidates( : cluster_info.locality.location_by_ts_id, tpi, move, &tablet_ids)); : } : // Shuffle the set of the tablet identifiers: that's to achieve even spread : // of moves across tables with the same skew. : std::shuffle(tablet_ids.begin(), tablet_ids.end(), random_device()); : string move_tablet_id; : for (const auto& tablet_id : tablet_ids) { : if (tablets_in_move.find(tablet_id) == tablets_in_move.end()) { : // For now, choose the very first tablet that does not have replicas : // in move. Later on, additional logic might be added to find : // the best candidate. : move_tablet_id = tablet_id; : break; : } : } > As a reviewer, my main gripe with this is that in reviewing this patch, it' Done http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc@637 PS3, Line 637: !location) { > nit: this is much less readable than specifying the actual type Done http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc@570 PS4, Line 570: ts_manager->GetDescriptorsAvailableForPlacement(&descriptors); > error: no member named 'GetAllLiveDescriptors' in 'kudu::master::TSManager' Done http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc@693 PS4, Line 693: bool table_skew = max_table_skew-min_table_skew > 1; > warning: redundant boolean literal in conditional return statement [readabi Done http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc@711 PS4, Line 711: return !IsClusterOrTablesSkewed(cluster_info_); > warning: redundant boolean literal in conditional return statement [readabi Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@85 PS5, Line 85: using kudu::rpc::Messenger; > warning: using decl 'Messenger' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@101 PS5, Line 101: auto_rebalancing_stop_flag > Yep, that sounds good to me. And if I remember correctly, that was one the Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@139 PS5, Line 139: > nit: should be 4 spaces for this sort of 'continuation indent' Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@147 PS5, Line 147: 300 > Do we want to make this parameter configurable or the hard-coded value is g This value isn't actually used in this implementation of the auto-rebalancer, just in RebalancerTool::RunWith(). I filled in the default value of the flag here to build the Config struct. What's the best practice here for this field? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@177 PS5, Line 177: DCHECK > What's the reason of using DCHECK() here but using CHECK() for thread_ vali Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@196 PS5, Line 196: CatalogManager::ScopedLeaderSharedLock l(catalog_manager_); > Taking a lock and then sleeping at line 200 doesn't look good. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@200 PS5, Line 200: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); > Should we just sleep at the top of the loop so we don't have to repeat this Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@209 PS5, Line 209: cluster_info_ > Nit: since we're blowing these away every iteration, how about just making Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@212 PS5, Line 212: Status s; : s = BuildClusterRawInfo(boost::none, &raw_info_); > nit: why not just Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@215 PS5, Line 215: s.ToString() << ": could not retrieve cluster info"; > nit: consider using Substitute() from gutil/strings/substitute.h here, e.g. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@221 PS5, Line 221: (s.ok()) > nit: drop the extra parens; same elsewhere Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@238 PS5, Line 238: this->I > nit: drop this->? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@245 PS5, Line 245: if (!FLAGS_auto_rebalancing_stop_flag) { > We should probably move this check to the top of the loop so we don't go th Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@274 PS5, Line 274: bool* policy_fixing > nit: it wasn't immediately obvious to me from "policy fixing" that this is Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@286 PS5, Line 286: replica_moves > Nit: space before Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@334 PS5, Line 334: vector<Rebalancer::ReplicaMove>* replica_moves, : rebalance::RebalancingAlgo* algo, : bool cross_location > nit: mind reordering these so the pure out param (replica_moves) comes afte Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@336 PS5, Line 336: bool cross_location > nit: for readability, consider using an enum like CrossLocations::{YES, NO} Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@433 PS5, Line 433: > nit: extra line Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@460 PS5, Line 460: > nit: extra line Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@462 PS5, Line 462: if (policy_fixing) > What is the actual difference between policy_fixing clause and the !policy_ Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@474 PS5, Line 474: unique_ptr<ConsensusServiceProxy> proxy; > Does this have to be heap-allocated? Same elsewhere. making it a shared_ptr to be able to reuse the TSdescriptor's GetConsensusProxy() http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@491 PS5, Line 491: MonoDelta::FromSeconds(60) > Where does this timeout comes from? Should it be configurable? Made it a configurable gflag! http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@500 PS5, Line 500: > nit: extra line Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@513 PS5, Line 513: HostPort hp; : RETURN_NOT_OK(hp.ParseString(leader_uuid, leader_hp.port())); : vector<Sockaddr> resolved; : RETURN_NOT_OK(hp.ResolveAddresses(&resolved)); : proxy.reset(new ConsensusServiceProxy(messenger_, resolved[0], hp.host())); > At lines 528-533 TSDescriptor for dst_ts_uuid is fetched. TSDescriptor has Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@564 PS5, Line 564: vector<ServerHealthSummary> tserver_summaries; : vector<TableSummary> table_summaries; : vector<TabletSummary> tablet_summaries; > Since the code below uses emplace_back()/push_back() to fill this vectors, Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@577 PS5, Line 577: (ts->last_address()) > Are the outer parentheses really necessary? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@577 PS5, Line 577: summary.ts_location = (ts->last_address()).ToString(); > How come address (i.e. HostPort) became a location for a tablet server? Th Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@594 PS5, Line 594: SysTablesEntryPB table_data = table->metadata().state().pb; > Is it possible to get a constant reference to the SysTablesEntryPB here? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@601 PS5, Line 601: > What about TableMetadataLock when retrieving information on all its tablets Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@615 PS5, Line 615: ReplicaTypeFilter::ANY_REPLICA > Are we really interested in non-voters as well while rebalancing? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@622 PS5, Line 622: TSInfoPB ts_info > Why not to use a reference here? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@626 PS5, Line 626: rep.ts_address = Substitute("$0:$1", addr.host(), addr.port()); > What about filling in 'is_leader' field for ReplicaSummary? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@627 PS5, Line 627: replicas.push_back(rep); > What about filling in the 'consensus_state' field for ReplicaSummary? As I Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@641 PS5, Line 641: raw_info->table_summaries = std::move(table_summaries); > nit: for readability, add return Status::OK() and move the code out of the Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@680 PS5, Line 680: namespace { > nit: add a line before 'namespace' Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@681 PS5, Line 681: Is > nit: sounds a bit awkward. Maybe remove? or "AreCluster.."? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@682 PS5, Line 682: > nit: extra line Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@687 PS5, Line 687: max_replica_count-min_replica_count > nit: add spaces, same below Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@691 PS5, Line 691: const auto min_table_skew = table_skew_info.begin()->first; : const auto m > const auto&? Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@709 PS5, Line 709: // One location: table and cluster skew? > nit: complete sentences in comments are preferred. Same elsewhere. Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc@5202 PS5, Line 5202: : TSManager* CatalogManager::ts_manager() const { : return master_->ts_manager(); : } : > nit: Rather than exposing this from the catalog manager, how about passing Done http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h File src/kudu/rebalance/rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h@147 PS5, Line 147: // Public accessors for fields of Rebalancer's config object. : bool ShouldRunPolicyFixer() const; : bool ShouldRunCrossLocationRebalancing() const; : bool ShouldRunIntraLocationRebalancing() const; > Don't need these anymore Done -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 5 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Wed, 02 Oct 2019 02:10:25 +0000 Gerrit-HasComments: Yes
