Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: WIP KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 2: (12 comments) It seems many tests are broken now, even those which are not related to the rebalancer itself. It would be nice to understand what's wrong and fix those. http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.h@66 PS2, Line 66: explicit nit: I don't think explicit is needed here since it's hard to write something like AutoRebalancerTask s = { a, b } unintentionally. In other words, explicit usually makes sense for single-parameter constructors http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@156 PS2, Line 156: AutoRebalancerTask::RunLoop() I think if you want to adapt the current implementation to this worker model, it's worth taking a look at RebalancerTool::Run() and RebalancerTool::RunWith(). I think one iteration of the while() cycle in RebalancerTool::RunWith() might be one of those 'atomic' steps which we want to schedule by this thread, checking whether this master is still leader before scheduling next moves. http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@165 PS2, Line 165: if (!catalog_manager_) continue; How could it happen that catalog_manager_ is nullptr? And if so, wouldn't it be a programming mistake? And can it ever change during the lifetime of AutoRebalancerTask? http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@168 PS2, Line 168: CatalogManager::ScopedLeaderSharedLock l(catalog_manager_); As it's written now, the scope of this lock is too wide -- it will take a whole iteration to pass before releasing this lock. Meanwhile, holding this lock doesn't mean the actual master leader cannot change. So, I would simply limit the scope of this lock for a quick leader check before proceeding with next rebalancing operation. http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@171 PS2, Line 171: // TODO(hannah.nguyen): holding onto this lock too long? : // Concerning especially if replica moves aren't async. Since this lock is used only to synchronize access to closing_ by this method and Shutdown(), you could limit it just to the scope of checking for closing_. The ThreadJoiner will do its job awaiting for the thread to complete the iteration and exit. Another way around is to use CountDownLatch, and then call latch.WaitFor(<MonoDelta interval>) instead of SleepFor(). Shutdown() then will need to call latch.Countdown() to signal that it's time to shut down the activity. http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@183 PS2, Line 183: (!BuildClusterRawInfo(boost::none, &raw_info).ok()) Does it deserve some logging if it was not possible to get a snapshot of the cluster's status? http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@190 PS2, Line 190: \ drop http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@210 PS2, Line 210: nit: this extra line isn't good for readability http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@214 PS2, Line 214: / TODO(hannah.nguyen): schedule moves for placement policy violations How to check for the result status of that? It's necessary to make sure all violations are fixed before moving the the next phase of the rebalancing if PolicyFixer is enabled. http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@235 PS2, Line 235: // Step 4: Execute moves. : // TODO(hannah.nguyen): do this I think many moves are to be done in the background, so this thread should simply be some sort of a dispatcher for scheduling next step and processing the result when they are ready. http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@239 PS2, Line 239: FLAGS_auto_rebalancing_restart_flag = false; I'm not sure I understand the logic behind changes in FLAG_auto_rebalancing_restart_flag. Who is about to set it to 'true' again? http://gerrit.cloudera.org:8080/#/c/14177/2/src/kudu/master/auto_rebalancer.cc@246 PS2, Line 246: nit here and 3 closing parenthesis above: an extra empty line -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 2 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 05 Sep 2019 18:50:27 +0000 Gerrit-HasComments: Yes
