Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 11: (11 comments) http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc@165 PS11, Line 165: // Each tserver is considered its own location. I'm not sure I understand what this means. As far as I know, in case of no locations are assigned, all tservers are considered to be in the same pseudo-location and that's not a location-aware cluster. In that sense, there is no need to satisfy the location placement policy or perform any inter-location balancing. Has that changed recently? As I can see, this test doesn't assign locations for tablet servers. If you want to start assigning locations to tablet servers, you can see an example how it's done for test scenarios like this one: https://github.com/apache/kudu/blob/03e2ada694290cafce0bea6ebbf092709aa64a2a/src/kudu/integration-tests/location_assignment-itest.cc#L77-L86 http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer-test.cc@180 PS11, Line 180: Each tserver is its own location ditto http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.h@165 PS11, Line 165: // Structs to hold cluster information. : // rebalance::ClusterRawInfo raw_info_; : // rebalance::ClusterInfo cluster_info_; : // rebalance::TabletsPlacementInfo placement_info_; Remove if no longer needed? http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@224 PS11, Line 224: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); Would this be a double-pause given there is another sleep at line 195? http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@228 PS11, Line 228: Rebalancer::MovesInProgress() Why the moves in progress are empty? I think this cannot be empty: the information should be kept between iterations. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@231 PS11, Line 231: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); ditto http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@236 PS11, Line 236: Rebalancer::MovesInProgress() Ditto: why is it empty? What if previous run has scheduled some moves? Maybe I'm missing something, but I don't see the state of the scheduled moves tracked somewhere. Please point there if I'm missing that. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@239 PS11, Line 239: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds) ditto http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@246 PS11, Line 246: IsClusterBalanced Why this method/function is necessary at all? Wouldn't GetMoves() return an empty set of mo ves if the cluster is balanced? http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@501 PS11, Line 501: // Since only returning live descriptors, all tservers healthy. I think this logic has a flaw: the descriptors available for placement are the tablet servers available to move replicas _to_, but there might be other alive tablet servers which contain too many replicas which might be moved _from_ those overloaded tablet servers. The idea behind original BuildClusterRawInfo() was about getting the whole picture. If not counting in tablet replicas at servers not available now for placement, the picture is distorted. That might result in first moving some replicas elsewhere, but then re-doing all that stuff once the tablet server which was not available for placement is back. I think there should be at least an option to avoid making any moves if any tablets are not up. At least, the safest default behavior is avoid moving replicas if it's not possible to get a full picture of replicas distribution in the cluster. Besides being safe, that default behavior would be compatible with the default behavior of the rebalancer tool. http://gerrit.cloudera.org:8080/#/c/14177/11/src/kudu/master/auto_rebalancer.cc@509 PS11, Line 509: ts->last_address().ToString(); I mentioned it before and I'm re-iterating this again: tablet server address is NOT a location and should not be treated like that. Otherwise, I'm not sure the rebalancing would behave as needed in case of non-location aware clusters. Please fix this. -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 11 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Sat, 05 Oct 2019 06:21:52 +0000 Gerrit-HasComments: Yes
