Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/14154 )
Change subject: KUDU-2914: Rebalance tool support moving replicas from some specific tablet servers ...................................................................... Patch Set 12: (20 comments) http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc File src/kudu/rebalance/rebalance-test.cc: http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@156 PS12, Line 156: auto get_values = [](pair<ServersByCountMap::const_iterator, const ? http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@157 PS12, Line 157: nit: fix the alignment http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@197 PS12, Line 197: bool HasSameContents(const multimap<int32_t, TableBalanceInfo>& lhs, : const multimap<int32_t, TableBalanceInfo>& rhs) { nit: is it possible to templatize bool HasSameContents(const ServersByCountMap&, const ServersByCountMap&) and this function so there would be less code duplication? http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@271 PS12, Line 271: ContainsIgnoredTserver I think this method by itself doesn't imply working on ignored or other subset of tablet servers. Maybe, renaming it to something like ContainsOneOf(...) Also, move this method into the private section of this class since its signature and logic hints that it's very specific to this class and unlikely to be reused somewhere else. http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@293 PS12, Line 293: ASSERT_FALSE(ContainsIgnoredTserver(ci.balance.servers_by_total_replica_count, I think it's worth adding a comment explaining the essence of the constrains being checked below. Yes, it's obvious what's going on, but it would be nice to explain why. http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@296 PS12, Line 296: ASSERT_FALSE(ContainsIgnoredTserver(elem.second.servers_by_replica_count, Add SCOPED_TRACE() to get more information on the context if ASSERT_FALSE() triggers. Having TableBalanceInfo::tablet_id from 'elem' would be enough, I think. http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@527 PS12, Line 527: when movement of if moving replicas from http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance_algo.h File src/kudu/rebalance/rebalance_algo.h: http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance_algo.h@92 PS12, Line 92: of nit: on http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalancer.cc File src/kudu/rebalance/rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalancer.cc@450 PS12, Line 450: auto it = tserver_replicas_count.find(ignored_tserver); : if (it == tserver_replicas_count.end()) { Would FindOrNull() work here? http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc File src/kudu/tools/rebalancer_tool-test.cc: http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@273 PS12, Line 273: "--ignored_tservers=" + : JoinStrings(ignored_tservers, ","), Join these two strings. http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@701 PS12, Line 701: If detecting a ignored tserver went down If an ignored tablet server went down http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@702 PS12, Line 702: servers server http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@706 PS12, Line 706: : nit: move this to the next line http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@757 PS12, Line 757: IsRebalancingInProgress() I think it might be important to know which tablet server is in progress of copying data (that's the why how this method works). What if it's some other tablet server (not the ignored one) is in progress of copying data? Or that's the server which is ignored? In other words, do you know whether this test covers both cases below: * the ignored tablet server goes down during replica movement from it * the ignored tablet server goes down before rebalancer tries to move a replica from it It would be nice to clarify on that. http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@795 PS12, Line 795: ASSERT_OK(cluster_->tablet_server(ignored_tserver_idx)->Restart()); Is there some significance behind starting the tablet server back? If yes, please a comment explaining why. If not, maybe drop this line. http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool.cc File src/kudu/tools/rebalancer_tool.cc: http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool.cc@639 PS12, Line 639: healthy healthy non-ignored http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool.cc@1243 PS12, Line 1243: has_updates |= s.ok() I'm not sure this is correct by the semantics of the 'has_updates' variable: if the method returned Status::OK(), that doesn't mean there has been an update. For update to be there, it's necessary to have 'is_complete' to be 'true'. What is the motivation for this change? http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc File src/kudu/tools/tool_action_cluster.cc: http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc@131 PS12, Line 131: the tablet server the source tablet server http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc@132 PS12, Line 132: "This setting is effective only if the '--ignored_tservers' flag" nit: add a space between 'flag' and '"' http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc@134 PS12, Line 134: replicas shouldn't be placed on all healthy ignored servers " : "automatically, e.g. set them into 'mainetance mode' or 'decommision mode'. then all ignored tablet servers must be placed into the 'maintenance mode' -- To view, visit http://gerrit.cloudera.org:8080/14154 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I86cfb740030946c13db1a9ca63d241f4907d6c89 Gerrit-Change-Number: 14154 Gerrit-PatchSet: 12 Gerrit-Owner: Yifan Zhang <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Comment-Date: Thu, 14 Nov 2019 02:19:02 +0000 Gerrit-HasComments: Yes
