Hannah Nguyen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 11:

(27 comments)

http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc@165
PS11, Line 165: // Each tserver is considered its own location.
> I'm not sure I understand what this means.  As far as I know, in case of no
Ah, misunderstood the testing setup. Will add in explicit location assignment 
for tablet servers--thank you for the code pointer!


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc@180
PS11, Line 180:  Each tserver is its own location
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.h
File src/kudu/master/auto_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.h@165
PS11, Line 165:   // Structs to hold cluster information.
              :   // rebalance::ClusterRawInfo raw_info_;
              :   // rebalance::ClusterInfo cluster_info_;
              :   // rebalance::TabletsPlacementInfo placement_info_;
> Remove if no longer needed?
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@146
PS11, Line 146: 300
> Is this used by the rebalancer thread?
No, it's not. This is the default value assigned in the rebalancer tool. Should 
I set it to something else or specify that it's unused?


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@224
PS11, Line 224: 
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
> Would this be a double-pause given there is another sleep at line 195?
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@228
PS11, Line 228: Rebalancer::MovesInProgress()
> Why the moves in progress are empty?  I think this cannot be empty: the inf
Ah, I initially intended for the replica movement to be synchronous (ie the 
loop iteration would wait for the moves to complete) but as implemented, will 
need to keep track of moves between iterations and do something similar to (or 
refactor and utilize) Rebalancer::FilterMoves() and variants of 
UpdateMovesInProgressStatus() in RebalancerTool.


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@231
PS11, Line 231: 
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@236
PS11, Line 236: Rebalancer::MovesInProgress()
> Ditto: why is it empty?  What if previous run has scheduled some moves?  Ma
Responded above.


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@239
PS11, Line 239: 
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@246
PS11, Line 246: IsClusterBalanced
> Why this method/function is necessary at all?  Wouldn't GetMoves() return a
Yes, GetMoves() will return an empty set if the cluster is balanced.
But I thought that it's also possible that GetMoves() could return an empty set 
of moves if there are errors in retrieving information about the cluster at 
some point, eg calls to BuildClusterRawInfo() for each location in 
intra-location rebalancing. In this case, there shouldn't be any moves 
scheduled and the task sleeps, but the cluster isn't necessarily balanced, so 
the round isn't complete.

I've used the result of IsClusterBalanced() to differentiate between completed 
rounds of rebalancing and resetting moves_scheduled_this_round_for_test_ to 0. 
Is using it for testing enough to justify keeping the function?


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@249
PS11, Line 249:         lock_guard<simple_spinlock> l(lock_for_test_);
              :         moves_scheduled_this_round_for_test_ = 0;
> nit: might be worth just using an atomic<int> or something
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@252
PS11, Line 252:       
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
> Don't need this anymore, right? We'll sleep up top? Same below.
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@298
PS11, Line 298:   // One location: use greedy rebalancing algorithm to find 
moves.
              :   if (ts_id_by_location.size() == 1) {
              :     rebalance::TwoDimensionalGreedyAlgo algo;
              :     RETURN_NOT_OK(GetMovesUsingRebalancingAlgo(raw_info, &algo, 
CrossLocations::NO, replica_moves));
              :     *has_location_violations = false;
              :   } else {
              :
> nit: could you write this avoiding the extra scope layering? It's easier on
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@317
PS11, Line 317:     // If no placement policy violations are found, do load 
rebalancing.
              :     if (replica_moves->empty()) {
> nit: could you write this avoiding the extra scope layering? It's easier on
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@358
PS11, Line 358: ion==Cro
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@375
PS11, Line 375: on==CrossLoca
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@384
PS11, Line 384:     std::random_device rd;
              :     std::mt19937 random_generator(rd());
> How about we have a single random_generator associated with the task and re
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@405
PS11, Line 405: ANY_REPLICA
> Why is this distinction over VOTER_REPLICA important?
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@498
PS11, Line 498:   
ts_manager_->GetDescriptorsAvailableForPlacement(&descriptors);
> What happens when not all of the tablet servers have registered with the ma
Changed the code according to Alexey's suggestion below, to return an error 
from this function if there are any tablet servers that aren't available for 
placement, and have the autorebalancer task sleep, but IIUC these unavailable 
tservers are still registered tservers.

Right now the autorebalancer only has descriptors of tservers that have 
registered with the TSManager. Is there a way to get information on 
un-registered tservers that have replicas?


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@501
PS11, Line 501:   // Since only returning live descriptors, all tservers 
healthy.
> I think this logic has a flaw: the descriptors available for placement are
I've changed the behavior to match the default behavior of the rebalancer tool; 
if any of the tservers are unavailable for placement, 
AutorebalancerTask::BuildClusterRawInfo() will return an error, and the task 
will sleep.


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@509
PS11, Line 509: ts->last_address().ToString();
> I mentioned it before and I'm re-iterating this again: tablet server addres
Oops, must have missed taking this out the last round of review.
If cluster isn't location-aware, no need to set summary.ts_location, since 
rebalancing/ksck code always checks to see if it's empty.


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@517
PS11, Line 517:     CatalogManager::ScopedLeaderSharedLock l(catalog_manager_);
> Check the status here.
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@550
PS11, Line 550: l
> nit: could you use a different variable name so this doesn't get conflated
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@572
PS11, Line 572:             ConsensusConfigType::COMMITTED,
              :             cstatepb.current_term(),
              :             cstatepb.committed_config().opid_index(),
              :             cstatepb.leader_uuid(),
              :             voters,
              :             non_voters);
> nit: too many spaces
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@739
PS11, Line 739:   sort(locations.begin(), locations.end());
> Why is this sorting important?
Oops, this was a remnant of code that is responsible for printing the 
per-location balance information. Will remove.


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@748
PS11, Line 748:
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/rebalance/rebalancer.h
File src/kudu/rebalance/rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/rebalance/rebalancer.h@160
PS11, Line 160:   // Filter move operations in 'replica_moves': remove all 
operations that would
              :   // involve moving replicas of tablets which are in 
'scheduled_moves'. The
              :   // 'replica_moves' cannot be null.
              :   static void FilterMoves(const MovesInProgress& 
scheduled_moves,
              :                           std::vector<ReplicaMove>* 
replica_moves);
> Did this need to be made public? It isn't used by auto_rebalancer or rebala
It's used in RebalancerTool::BaseRunner::GetNextMoves().
I'll bump it down to protected, since RebalancerTool inherits from Rebalancer.



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 11
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 16 Oct 2019 04:28:57 +0000
Gerrit-HasComments: Yes

Reply via email to