Hannah Nguyen has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 18: (35 comments) http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@146 PS18, Line 146: // Make sure the leader master has begun the auto-rebalancing thread. : void CheckAutoRebalancerStarted() { : ASSERT_EVENTUALLY([&] { : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_LT(0, NumLoopIterations(leader_idx)); : }); : } : : // Make sure the auto-rebalancing loop has iterated a few times, : // and no moves were scheduled. : void CheckNoMovesScheduled() { : ASSERT_EVENTUALLY([&] { : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_LT(3, NumLoopIterations(leader_idx)); : ASSERT_EQ(0, NumMovesScheduled(leader_idx)); : }); : } : : void CheckSomeMovesScheduled() { : ASSERT_EVENTUALLY([&] { : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_LT(1, NumLoopIterations(leader_idx)); : ASSERT_LT(0, NumMovesScheduled(leader_idx)); : }); : } > It isn't important that these calls aren't repeatable is it? At least I don It doesn't matter with the way these are used in the current set of tests are written, but I'll update it to the above in case they're used in future tests and need to be called multiple times in one test. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@178 PS18, Line 178: > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@184 PS18, Line 184: > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@253 PS18, Line 253: ASSERT_EQ(0, NumLoopIterations(i)); : } > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@269 PS18, Line 269: // Create a cluster that is initially balanced. : // Bring up another tserver, and verify that moves are scheduled, : // since the cluster is no longer balanced. : TEST_F(AutoRebalancerTest, MovesScheduledIfAddTserver) { > Do you think it's worth testing that recovering replicas doesn't count towa If there's a discrepancy between the number of tservers the TSManager believes are available and the number of tservers the CatalogManager believes are up with tables on them, then to be safe, the auto-rebalancer just exits the current iteration and sleeps (exiting BuildClusterRawInfo()), and waits to see if next cycle it will be consistent. If the number of tservers is already consistent, but there are some tablets that are still under-replicated, then those tablets aren't considered, according to the rebalancing logic that's already present (and been tested) in BuildTabletsPlacementInfo(), in rebalance/placement_policy_util. Do you think it's worth adding a test to verify the former situation? Or any other scenario I might be missing? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@424 PS18, Line 424: Inject latency. > nit: Inject latency to make tablet's Raft configuration unstable due to fre Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@429 PS18, Line 429: the latency. > nit: frequent leadership changes. Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@66 PS18, Line 66: // > nit: remove the extra line Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@83 PS18, Line 83: friend class CatalogManager; > nit: Is this still needed? Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@157 PS18, Line 157: nor 'completion_status' > That doesn't exist anymore. Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@159 PS18, Line 159: 'completion_status' is set to the : // corresponding move status: Status::OK() if the move succeeded, or : // non-OK status if it failed. > ditto Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@163 PS18, Line 163: and 'completion_status' > ditto Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@164 PS18, Line 164: contain valid information. > This isn't the case with the new implementation: now with even non-OK retur Clarified the function description! http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@62 PS18, Line 62: using kudu::cluster_summary::ConsensusConfigType; > warning: using decl 'ConsensusConfigType' is unused [misc-unused-using-decl Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@63 PS18, Line 63: using kudu::cluster_summary::ConsensusState; > warning: using decl 'ConsensusState' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@97 PS18, Line 97: using std::lock_guard; > warning: using decl 'lock_guard' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120 PS18, Line 120: RPC timeout in seconds > nit: do you think it's worth elaborating on what happens when RPCs time out Added a little more detail to the description. Should I add the following information as well? If an RPC times out, like other errors (such as being unable to gather cluster information), the current cycle ends, the thread sleeps, and tries again next cycle. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120 PS18, Line 120: int64 > Would int32/uint32 suffice here? Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120 PS18, Line 120: auto_rebalancing_timeout_seconds > This naming looks a bit confusing because it's too generic. Maybe, rename Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@168 PS18, Line 168: number_of_loop_iterations_for_test_ = 0; : moves_scheduled_this_round_for_test_ = 0; > nit: put these in the member initializer list? Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@201 PS18, Line 201: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); > Is it safe to sleep here because we both don't have the lock and aren't the Will add a comment. This is also tested in NextLeaderResumesAutoRebalancing. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@249 PS18, Line 249: : s = ExecuteMoves(replica_moves, has_location_violations); : WARN_NOT_OK(s, Substitute("could not schedule auto-rebalancing replica moves: $0", : s.ToString())); > nit: don't need to save the local variable; WARN_NOT_OK prints the Status. Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@324 PS18, Line 324: location_raw_info, &algo, CrossLocations::NO, &rep_moves)); > nit: add a couple spaces Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@339 PS18, Line 339: max_moves -= replica_moves->size(); > We should probably exit early here if this is non-positive, lest we burn a Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@374 PS18, Line 374: replica_moves->swap(rep_moves); > nit: using std::move() is preferred in such cases: Done. Updated same above as well. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@396 PS18, Line 396: leader > nit: this is only used once; maybe inline it? Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@439 PS18, Line 439: if (!ts_manager_->LookupTSByUUID(leader_uuid, &desc)) { : return Status::NotFound("Could not find leader replica's tserver"); : } > Would it be appropriate to also determine whether dst_ts_uuid is available Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@447 PS18, Line 447: dst_ts_uuid > nit: maybe DCHECK this isn't empty? Or predicate on dst_ts_uuid being empty Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@451 PS18, Line 451: HostPortToPB > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@452 PS18, Line 452: else { : RETURN_NOT_OK(hp.ParseString(leader_uuid, leader_hp.port())); : vector<Sockaddr> resolved; : RETURN_NOT_OK(hp.ResolveAddresses(&resolved)); : proxy.reset(new ConsensusServiceProxy(messenger_, resolved[0], hp.host())); : } : > We get the leader proxy unconditionally, so maybe pull it out of the condit Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@669 PS18, Line 669: if (!s.ok()) { : replica_moves->erase(replica_moves->begin() + i); : LOG(WARNING) << Substitute("Could not move replica: $0", s.ToString()); : return s; : } : : // If move was completed, remove it from 'replica_moves'. : if (move_is_complete) { : indexes_to_remove.push_back(i); : } > I think this code doesn't bode well anymore with new semantics of the Check I have updated CheckMoveCompleted() to only set move_is_complete true if it will return an OK status, so there shouldn't be a situation where a move returns non-OK and is complete, but gets stuck because it isn't removed. All non-OK moves will be removed in L670, and CheckReplicaMovesCompleted() will return. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@689 PS18, Line 689: unknown > Why 'unknown'? I thought that was you who added this TODO, no? Whoops! It's me. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@706 PS18, Line 706: shared_ptr<ConsensusServiceProxy> proxy; > nit: move closer to the point of usage, i.e. right before linke 711 Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc@313 PS18, Line 313: runtime > This probably shouldn't be labeled "runtime" given the initialization of th Done http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc File src/kudu/rebalance/rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc@528 PS18, Line 528: move_tablet_id > nit: can std::move(move_tablet_id) Done -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 18 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Mon, 09 Mar 2020 16:34:43 +0000 Gerrit-HasComments: Yes
