Hannah Nguyen has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 11: (27 comments) http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc@165 PS11, Line 165: // Each tserver is considered its own location. > I'm not sure I understand what this means. As far as I know, in case of no Ah, misunderstood the testing setup. Will add in explicit location assignment for tablet servers--thank you for the code pointer! http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc@180 PS11, Line 180: Each tserver is its own location > ditto Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.h@165 PS11, Line 165: // Structs to hold cluster information. : // rebalance::ClusterRawInfo raw_info_; : // rebalance::ClusterInfo cluster_info_; : // rebalance::TabletsPlacementInfo placement_info_; > Remove if no longer needed? Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@146 PS11, Line 146: 300 > Is this used by the rebalancer thread? No, it's not. This is the default value assigned in the rebalancer tool. Should I set it to something else or specify that it's unused? http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@224 PS11, Line 224: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); > Would this be a double-pause given there is another sleep at line 195? Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@228 PS11, Line 228: Rebalancer::MovesInProgress() > Why the moves in progress are empty? I think this cannot be empty: the inf Ah, I initially intended for the replica movement to be synchronous (ie the loop iteration would wait for the moves to complete) but as implemented, will need to keep track of moves between iterations and do something similar to (or refactor and utilize) Rebalancer::FilterMoves() and variants of UpdateMovesInProgressStatus() in RebalancerTool. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@231 PS11, Line 231: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); > ditto Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@236 PS11, Line 236: Rebalancer::MovesInProgress() > Ditto: why is it empty? What if previous run has scheduled some moves? Ma Responded above. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@239 PS11, Line 239: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds) > ditto Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@246 PS11, Line 246: IsClusterBalanced > Why this method/function is necessary at all? Wouldn't GetMoves() return a Yes, GetMoves() will return an empty set if the cluster is balanced. But I thought that it's also possible that GetMoves() could return an empty set of moves if there are errors in retrieving information about the cluster at some point, eg calls to BuildClusterRawInfo() for each location in intra-location rebalancing. In this case, there shouldn't be any moves scheduled and the task sleeps, but the cluster isn't necessarily balanced, so the round isn't complete. I've used the result of IsClusterBalanced() to differentiate between completed rounds of rebalancing and resetting moves_scheduled_this_round_for_test_ to 0. Is using it for testing enough to justify keeping the function? http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@249 PS11, Line 249: lock_guard<simple_spinlock> l(lock_for_test_); : moves_scheduled_this_round_for_test_ = 0; > nit: might be worth just using an atomic<int> or something Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@252 PS11, Line 252: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); > Don't need this anymore, right? We'll sleep up top? Same below. Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@298 PS11, Line 298: // One location: use greedy rebalancing algorithm to find moves. : if (ts_id_by_location.size() == 1) { : rebalance::TwoDimensionalGreedyAlgo algo; : RETURN_NOT_OK(GetMovesUsingRebalancingAlgo(raw_info, &algo, CrossLocations::NO, replica_moves)); : *has_location_violations = false; : } else { : > nit: could you write this avoiding the extra scope layering? It's easier on Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@317 PS11, Line 317: // If no placement policy violations are found, do load rebalancing. : if (replica_moves->empty()) { > nit: could you write this avoiding the extra scope layering? It's easier on Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@358 PS11, Line 358: ion==Cro > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@375 PS11, Line 375: on==CrossLoca > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@384 PS11, Line 384: std::random_device rd; : std::mt19937 random_generator(rd()); > How about we have a single random_generator associated with the task and re Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@405 PS11, Line 405: ANY_REPLICA > Why is this distinction over VOTER_REPLICA important? Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@498 PS11, Line 498: ts_manager_->GetDescriptorsAvailableForPlacement(&descriptors); > What happens when not all of the tablet servers have registered with the ma Changed the code according to Alexey's suggestion below, to return an error from this function if there are any tablet servers that aren't available for placement, and have the autorebalancer task sleep, but IIUC these unavailable tservers are still registered tservers. Right now the autorebalancer only has descriptors of tservers that have registered with the TSManager. Is there a way to get information on un-registered tservers that have replicas? http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@501 PS11, Line 501: // Since only returning live descriptors, all tservers healthy. > I think this logic has a flaw: the descriptors available for placement are I've changed the behavior to match the default behavior of the rebalancer tool; if any of the tservers are unavailable for placement, AutorebalancerTask::BuildClusterRawInfo() will return an error, and the task will sleep. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@509 PS11, Line 509: ts->last_address().ToString(); > I mentioned it before and I'm re-iterating this again: tablet server addres Oops, must have missed taking this out the last round of review. If cluster isn't location-aware, no need to set summary.ts_location, since rebalancing/ksck code always checks to see if it's empty. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@517 PS11, Line 517: CatalogManager::ScopedLeaderSharedLock l(catalog_manager_); > Check the status here. Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@550 PS11, Line 550: l > nit: could you use a different variable name so this doesn't get conflated Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@572 PS11, Line 572: ConsensusConfigType::COMMITTED, : cstatepb.current_term(), : cstatepb.committed_config().opid_index(), : cstatepb.leader_uuid(), : voters, : non_voters); > nit: too many spaces Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@739 PS11, Line 739: sort(locations.begin(), locations.end()); > Why is this sorting important? Oops, this was a remnant of code that is responsible for printing the per-location balance information. Will remove. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@748 PS11, Line 748: > nit: extra space Done http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/rebalance/rebalancer.h File src/kudu/rebalance/rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/rebalance/rebalancer.h@160 PS11, Line 160: // Filter move operations in 'replica_moves': remove all operations that would : // involve moving replicas of tablets which are in 'scheduled_moves'. The : // 'replica_moves' cannot be null. : static void FilterMoves(const MovesInProgress& scheduled_moves, : std::vector<ReplicaMove>* replica_moves); > Did this need to be made public? It isn't used by auto_rebalancer or rebala It's used in RebalancerTool::BaseRunner::GetNextMoves(). I'll bump it down to protected, since RebalancerTool inherits from Rebalancer. -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 11 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Wed, 16 Oct 2019 04:28:57 +0000 Gerrit-HasComments: Yes
