Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 17: (22 comments) Focused mostly on the non-test code. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@117 PS17, Line 117: Status GetMoves( nit: can you also note how this can fail? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@132 PS17, Line 132: Marks replicas that nit: marks them with what? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@137 PS17, Line 137: Status ExecuteMoves( nit: can you also note how this can fail? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@139 PS17, Line 139: const bool& has_location_violations nit: booleans are trivially copyable so we don't need to pass them by const ref. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@188 PS17, Line 188: wake_up_cv_ It wasn't clear to me that this is only used to signal that the rebalancer thread is shutting down. Maybe rename it 'shutdown_cv_'? And update the comment to reflect that? http://gerrit.cloudera.org:8080/#/c/14177/14/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/14/src/kudu/master/auto_rebalancer.cc@660 PS14, Line 660: replicas_at_location. > Actually just removed this param; can just check size of replica_moves per This principal holds true everywhere. Partial or incomplete mutation makes things difficult to reason about. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@199 PS17, Line 199: // Check if shutdown was signaled. : { : lock_guard<Mutex> l(lock_); : if (closing_) return; : } : nit: maybe this should be if (wake_up_cv_.WaitFor(MonoDelta::FromSeconds(FLAGS_auto_rebal..))) { return; } then we'd be able to get rid of that sleep at the bottom. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@207 PS17, Line 207: CatalogManager::ScopedLeaderSharedLock l(catalog_manager_); : if (!l.first_failed_status().ok()) { : moves_scheduled_this_round_for_test_ = 0; : continue; : } Should we at least sleep here? Otherwise this is a tight loop for followers. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@261 PS17, Line 261: continue; If we're moving on without successfully executing and calling CheckReplicaMovesCompleted(), doesn't that mean 'replica_moves' might not be empty upon each iteration? Should we be initializing a new 'replica_moves' on each iteration? To make matters worse, it seems like both 'replica_moves' and 'has_location_violations' is updated even if this fails, meaning we need to make sure we're handling the failure scenario with care (or just don't update them if this fails, as I note in-line with the test methods). It's probably worth adding a test that rebalances a cluster with unstable Raft configurations (e.g. using latency injection flags and low Raft timeouts) to test the behavior when this fails. As it is, I'm pretty sure we'd hit a DCHECK at the top of the loop. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@267 PS17, Line 267: wake_up_cv_.WaitFor( : MonoDelta::FromSeconds(FLAGS_auto_rebalancing_wait_for_replica_moves_seconds)); : if (closing_) { : return; : } This can be: if (wake_up_cv_.WaitFor(MonoDelta::FromSeconds(FLAGS_auto_reba..))) { return; } http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@320 PS17, Line 320: // If no placement policy violations are found, do load rebalancing. : if (!replica_moves->empty()) { : return Status::OK(); : } nit: move this into the policy fixer block? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@326 PS17, Line 326: *has_location_violations = false; nit: update a local variable and swap at the end? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@332 PS17, Line 332: /*cross_location=*/ nit: don't need this now that the enum is self-documenting http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@343 PS17, Line 343: replica_moves nit: update a local variable and swap at the end? Also add a couple spaces http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@388 PS17, Line 388: replica_moves nit: update a local variable and swap at the end? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@450 PS17, Line 450: Mark the specified destination tserver, nit: what mark are we making? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@501 PS17, Line 501: TSDescriptorVector descriptors; : ts_manager_->GetDescriptorsAvailableForPlacement(&descriptors); : if (descriptors.size() != ts_manager_->GetLiveCount()) { : return Status::IllegalState(Substitute("not all tservers available for tablet placement")); : } The TSManager only knows about tablet servers that have heartbeated recently, whereas the catalog manager knows about all tablet servers that contain tablets. I think we want to wait until we know that all of the tablet servers that contain tablets are online. Otherwise, I'm pretty sure this condition is almost always true (unless a tablet server has stopped heartbeating for some reason). If we're supposed to have 10 tablet servers, we restart the master, and 3 tablet servers heartbeat to the master, this condition will be true. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@526 PS17, Line 526: RETURN_NOT_OK(catalog_manager_->GetAllTables(&table_infos)); Do we need to RETURN_NOT_OK(leaderlock.first_failed_status())? Do we have any tests that exercise rebalancing when there are leadership changes on the master? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@545 PS17, Line 545: TabletMetadataLock l(tablet.get(), LockMode::READ); It'd be much less error-prone and much easier to read if this were called `tablet_l` and the lock at L532 were `table_l` or something to highlight that they're different locks. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@570 PS17, Line 570: // Consensus state information is the same for all replicas of this tablet. Is this true? Or is it a benign approximation? My guess is that it's an approximation because it's easy to get what the master believes to be the consensus state of the tablet, and that it's benign because rebalancing doesn't actually consider the consensus state when deciding whether to make moves. Are either of these true? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@700 PS17, Line 700: hange 'moves_complete' to false No longer applies? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@775 PS17, Line 775: *completion_status = Status::Incomplete(Substitute( : "tablet $0, TS $1 -> TS $2 move failed, target replica disappeared", : tablet_uuid, from_ts_uuid, to_ts_uuid)); This isn't used if we return Status::OK(). Why not just return the error? -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 17 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Fri, 21 Feb 2020 20:40:31 +0000 Gerrit-HasComments: Yes
