Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14154 )
Change subject: KUDU-2914: Rebalance tool support moving replicas from some specific tablet servers ...................................................................... Patch Set 6: (13 comments) I did a high-level look through the code, though I still need to understand the rebalancer's runner hierarchy more. http://gerrit.cloudera.org:8080/#/c/14154/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14154/6//COMMIT_MSG@20 PS6, Line 20: unhealthy : tservers in the given ignored tservers will also be ignored by : the rebalancer tool Why bother differentiating between unhealthy and healthy ignored servers? Is it because we expect the re-replication to happen for those tservers automatically? Could you document the reasoning somewhere in a comment? http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance-test.cc File src/kudu/rebalance/rebalance-test.cc: http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance-test.cc@157 PS6, Line 157: auto getValues = [](pair<ServersByCountMap::const_iterator, nit: for local variables, we prefer snake_case, or sometimes for lambdas, we the same PascalCase we use for functions. http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance_algo.h File src/kudu/rebalance/rebalance_algo.h: http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance_algo.h@93 PS6, Line 93: helathy_ignored_tservers; nit: "healthy" Alternatively, rather than having the concept of "healthy" and "ignored", which are somewhat concepts of the CLI tool rather than the rebalancing algorithm, how about calling this "replica_counts_to_empty" that refer to the counts for tablet servers that the rebalancing should leave empty. http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool-test.cc File src/kudu/tools/rebalancer_tool-test.cc: http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool-test.cc@241 PS6, Line 241: ::testing::WithParamInterface<Kudu1097> Why bother testing the 3-2-3 replication? I don't think maintenance mode will do much in this mode. http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.h File src/kudu/tools/rebalancer_tool.h: http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.h@322 PS6, Line 322: IgnoredTserversRunner nit: Could you add a class-wide comment explaining what the IgnoredTserversRunner does that the PolicyFixer doesn't? What is the final "balanced" state for each? http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc File src/kudu/tools/rebalancer_tool.cc: http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@639 PS6, Line 639: "Too many ignored tservers.\nThe number of remaining tservers must be at least $0." nit: for returned statuses, we generally just use a single line message. Also maybe list the number of healthy servers? E.g. Substitute("Too many ignored servers; $0 healthy servers exist but $1 are required", remaining_tservers_count, max_replication_factor) http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1217 PS6, Line 1217: DCHECK(s.ok()); nit: this is trivially true based on the line above, so maybe remove it? http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1302 PS6, Line 1302: // Use pessimistic /2 limit for max_moves_per_servers_ since the : // desitnation servers for the move of the replica marked with : // the REPLACE attribute is not known. nit: while I appreciate you taking care of this TODO, could you move this change in behavior into a separate patch with targeted testing? If possible, it'd be nice to make sure each patch is focused. http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1337 PS6, Line 1337: // update helper containers. : const auto op_count = op_count_per_ts_[ts_uuid]++; : const auto op_range = ts_per_op_count_.equal_range(op_count); : bool ts_op_count_updated = false; : for (auto it = op_range.first; it != op_range.second; ++it) { : if (it->second == ts_uuid) { : ts_per_op_count_.erase(it); : ts_per_op_count_.emplace(op_count + 1, ts_uuid); : ts_op_count_updated = true; : break; : } : } Do you know why the policy fixer didn't do this before? http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1421 PS6, Line 1421: ignoted nit: ignored http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1482 PS6, Line 1482: tablet_id nit: tserver ID http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1506 PS6, Line 1506: i<max_moves_per_server_ nit: fix spacing http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/tool_action_cluster.cc File src/kudu/tools/tool_action_cluster.cc: http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/tool_action_cluster.cc@131 PS6, Line 131: If set true, " : "the healthy ignored servers should be set into maintenance mode first. While this is the most effective way to decommission a tserver, I don't think it should be a hard requirement of this tool. For instance, if in the future, we add a "decommissioning" state, I think we would want this flag to work as well. -- To view, visit http://gerrit.cloudera.org:8080/14154 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I86cfb740030946c13db1a9ca63d241f4907d6c89 Gerrit-Change-Number: 14154 Gerrit-PatchSet: 6 Gerrit-Owner: Yifan Zhang <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Comment-Date: Sat, 19 Oct 2019 02:37:19 +0000 Gerrit-HasComments: Yes
