Hannah Nguyen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 18:

(35 comments)

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@146
PS18, Line 146:   // Make sure the leader master has begun the auto-rebalancing 
thread.
              :   void CheckAutoRebalancerStarted() {
              :     ASSERT_EVENTUALLY([&] {
              :       int leader_idx;
              :       ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
              :       ASSERT_LT(0, NumLoopIterations(leader_idx));
              :     });
              :   }
              :
              :   // Make sure the auto-rebalancing loop has iterated a few 
times,
              :   // and no moves were scheduled.
              :   void CheckNoMovesScheduled() {
              :     ASSERT_EVENTUALLY([&] {
              :       int leader_idx;
              :       ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
              :       ASSERT_LT(3, NumLoopIterations(leader_idx));
              :       ASSERT_EQ(0, NumMovesScheduled(leader_idx));
              :     });
              :   }
              :
              :   void CheckSomeMovesScheduled() {
              :     ASSERT_EVENTUALLY([&] {
              :       int leader_idx;
              :       ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
              :       ASSERT_LT(1, NumLoopIterations(leader_idx));
              :       ASSERT_LT(0, NumMovesScheduled(leader_idx));
              :     });
              :   }
> It isn't important that these calls aren't repeatable is it? At least I don
It doesn't matter with the way these are used in the current set of tests are 
written, but I'll update it to the above in case they're used in future tests 
and need to be called multiple times in one test.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@178
PS18, Line 178:  
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@184
PS18, Line 184:      
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@253
PS18, Line 253:        ASSERT_EQ(0, NumLoopIterations(i));
              :       }
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@269
PS18, Line 269: // Create a cluster that is initially balanced.
              : // Bring up another tserver, and verify that moves are 
scheduled,
              : // since the cluster is no longer balanced.
              : TEST_F(AutoRebalancerTest, MovesScheduledIfAddTserver) {
> Do you think it's worth testing that recovering replicas doesn't count towa
If there's a discrepancy between the number of tservers the TSManager believes 
are available and the number of tservers the CatalogManager believes are up 
with tables on them, then to be safe, the auto-rebalancer just exits the 
current iteration and sleeps (exiting BuildClusterRawInfo()), and waits to see 
if next cycle it will be consistent.
If the number of tservers is already consistent, but there are some tablets 
that are still under-replicated, then those tablets aren't considered, 
according to the rebalancing logic that's already present (and been tested) in 
BuildTabletsPlacementInfo(), in rebalance/placement_policy_util.

Do you think it's worth adding a test to verify the former situation? Or any 
other scenario I might be missing?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@424
PS18, Line 424: Inject latency.
> nit: Inject latency to make tablet's Raft configuration unstable due to fre
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@429
PS18, Line 429: the latency.
> nit: frequent leadership changes.
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h
File src/kudu/master/auto_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@66
PS18, Line 66: //
> nit: remove the extra line
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@83
PS18, Line 83:   friend class CatalogManager;
> nit: Is this still needed?
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@157
PS18, Line 157: nor 'completion_status'
> That doesn't exist anymore.
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@159
PS18, Line 159: 'completion_status' is set to the
              :   // corresponding move status: Status::OK() if the move 
succeeded, or
              :   // non-OK status if it failed.
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@163
PS18, Line 163: and 'completion_status'
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@164
PS18, Line 164: contain valid information.
> This isn't the case with the new implementation: now with even non-OK retur
Clarified the function description!


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@62
PS18, Line 62: using kudu::cluster_summary::ConsensusConfigType;
> warning: using decl 'ConsensusConfigType' is unused [misc-unused-using-decl
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@63
PS18, Line 63: using kudu::cluster_summary::ConsensusState;
> warning: using decl 'ConsensusState' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@97
PS18, Line 97: using std::lock_guard;
> warning: using decl 'lock_guard' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120
PS18, Line 120: RPC timeout in seconds
> nit: do you think it's worth elaborating on what happens when RPCs time out
Added a little more detail to the description. Should I add the following 
information as well?

If an RPC times out, like other errors (such as being unable to gather cluster 
information), the current cycle ends, the thread sleeps, and tries again next 
cycle.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120
PS18, Line 120: int64
> Would int32/uint32 suffice here?
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120
PS18, Line 120: auto_rebalancing_timeout_seconds
> This naming looks a bit confusing because it's too generic.  Maybe, rename
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@168
PS18, Line 168:   number_of_loop_iterations_for_test_ = 0;
              :   moves_scheduled_this_round_for_test_ = 0;
> nit: put these in the member initializer list?
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@201
PS18, Line 201:         
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
> Is it safe to sleep here because we both don't have the lock and aren't the
Will add a comment.
This is also tested in NextLeaderResumesAutoRebalancing.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@249
PS18, Line 249:
              :     s = ExecuteMoves(replica_moves, has_location_violations);
              :     WARN_NOT_OK(s, Substitute("could not schedule 
auto-rebalancing replica moves: $0",
              :                               s.ToString()));
> nit: don't need to save the local variable; WARN_NOT_OK prints the Status.
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@324
PS18, Line 324:         location_raw_info, &algo, CrossLocations::NO, 
&rep_moves));
> nit: add a couple spaces
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@339
PS18, Line 339:   max_moves -= replica_moves->size();
> We should probably exit early here if this is non-positive, lest we burn a
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@374
PS18, Line 374: replica_moves->swap(rep_moves);
> nit: using std::move() is preferred in such cases:
Done. Updated same above as well.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@396
PS18, Line 396: leader
> nit: this is only used once; maybe inline it?
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@439
PS18, Line 439:       if (!ts_manager_->LookupTSByUUID(leader_uuid, &desc)) {
              :         return Status::NotFound("Could not find leader 
replica's tserver");
              :       }
> Would it be appropriate to also determine whether dst_ts_uuid is available
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@447
PS18, Line 447: dst_ts_uuid
> nit: maybe DCHECK this isn't empty? Or predicate on dst_ts_uuid being empty
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@451
PS18, Line 451: HostPortToPB
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@452
PS18, Line 452:  else {
              :       RETURN_NOT_OK(hp.ParseString(leader_uuid, 
leader_hp.port()));
              :       vector<Sockaddr> resolved;
              :       RETURN_NOT_OK(hp.ResolveAddresses(&resolved));
              :       proxy.reset(new ConsensusServiceProxy(messenger_, 
resolved[0], hp.host()));
              :     }
              :
> We get the leader proxy unconditionally, so maybe pull it out of the condit
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@669
PS18, Line 669:     if (!s.ok()) {
              :       replica_moves->erase(replica_moves->begin() + i);
              :       LOG(WARNING) << Substitute("Could not move replica: $0", 
s.ToString());
              :       return s;
              :     }
              :
              :     // If move was completed, remove it from 'replica_moves'.
              :     if (move_is_complete) {
              :       indexes_to_remove.push_back(i);
              :     }
> I think this code doesn't bode well anymore with new semantics of the Check
I have updated CheckMoveCompleted() to only set move_is_complete true if it 
will return an OK status, so there shouldn't be a situation where a move 
returns non-OK and is complete, but gets stuck because it isn't removed.

All non-OK moves will be removed in L670, and CheckReplicaMovesCompleted() will 
return.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@689
PS18, Line 689: unknown
> Why 'unknown'?  I thought that was you who added this TODO, no?
Whoops! It's me.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@706
PS18, Line 706: shared_ptr<ConsensusServiceProxy> proxy;
> nit: move closer to the point of usage, i.e. right before linke 711
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc@313
PS18, Line 313: runtime
> This probably shouldn't be labeled "runtime" given the initialization of th
Done


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc
File src/kudu/rebalance/rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc@528
PS18, Line 528: move_tablet_id
> nit: can std::move(move_tablet_id)
Done



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 18
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 09 Mar 2020 16:34:43 +0000
Gerrit-HasComments: Yes

Reply via email to