[kudu-CR] KUDU-2914: Rebalance tool support moving replicas from some specific tablet servers

Andrew Wong (Code Review) Fri, 18 Oct 2019 19:38:33 -0700

Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14154 )


Change subject: KUDU-2914: Rebalance tool support moving replicas from some 
specific tablet servers
......................................................................


Patch Set 6:

(13 comments)

I did a high-level look through the code, though I still need to understand the 
rebalancer's runner hierarchy more.

http://gerrit.cloudera.org:8080/#/c/14154/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14154/6//COMMIT_MSG@20
PS6, Line 20: unhealthy
            : tservers in the given ignored tservers will also be ignored by
            : the rebalancer tool
Why bother differentiating between unhealthy and healthy ignored servers? Is it 
because we expect the re-replication to happen for those tservers 
automatically? Could you document the reasoning somewhere in a comment?


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance-test.cc
File src/kudu/rebalance/rebalance-test.cc:

http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance-test.cc@157
PS6, Line 157:     auto getValues = [](pair<ServersByCountMap::const_iterator,
nit: for local variables, we prefer snake_case, or sometimes for lambdas, we 
the same PascalCase we use for functions.


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance_algo.h
File src/kudu/rebalance/rebalance_algo.h:

http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/rebalance/rebalance_algo.h@93
PS6, Line 93:  helathy_ignored_tservers;
nit: "healthy"

Alternatively, rather than having the concept of "healthy" and "ignored", which 
are somewhat concepts of the CLI tool rather than the rebalancing algorithm, 
how about calling this "replica_counts_to_empty" that refer to the counts for 
tablet servers that the rebalancing should leave empty.


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool-test.cc
File src/kudu/tools/rebalancer_tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool-test.cc@241
PS6, Line 241: ::testing::WithParamInterface<Kudu1097>
Why bother testing the 3-2-3 replication? I don't think maintenance mode will 
do much in this mode.


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.h
File src/kudu/tools/rebalancer_tool.h:

http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.h@322
PS6, Line 322: IgnoredTserversRunner
nit: Could you add a class-wide comment explaining what the 
IgnoredTserversRunner does that the PolicyFixer doesn't? What is the final 
"balanced" state for each?


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc
File src/kudu/tools/rebalancer_tool.cc:

http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@639
PS6, Line 639: "Too many ignored tservers.\nThe number of remaining tservers 
must be at least $0."
nit: for returned statuses, we generally just use a single line message. Also 
maybe list the number of healthy servers? E.g.

Substitute("Too many ignored servers; $0 healthy servers exist but $1 are 
required", remaining_tservers_count, max_replication_factor)


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1217
PS6, Line 1217:   DCHECK(s.ok());
nit: this is trivially true based on the line above, so maybe remove it?


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1302
PS6, Line 1302:   // Use pessimistic /2 limit for max_moves_per_servers_ since 
the
              :   // desitnation servers for the move of the replica marked with
              :   // the REPLACE attribute is not known.
nit: while I appreciate you taking care of this TODO, could you move this 
change in behavior into a separate patch with targeted testing? If possible, 
it'd be nice to make sure each patch is focused.


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1337
PS6, Line 1337:   // update helper containers.
              :   const auto op_count = op_count_per_ts_[ts_uuid]++;
              :   const auto op_range = ts_per_op_count_.equal_range(op_count);
              :   bool ts_op_count_updated = false;
              :   for (auto it = op_range.first; it != op_range.second; ++it) {
              :     if (it->second == ts_uuid) {
              :       ts_per_op_count_.erase(it);
              :       ts_per_op_count_.emplace(op_count + 1, ts_uuid);
              :       ts_op_count_updated = true;
              :       break;
              :     }
              :   }
Do you know why the policy fixer didn't do this before?


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1421
PS6, Line 1421: ignoted
nit: ignored


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1482
PS6, Line 1482: tablet_id
nit: tserver ID


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/rebalancer_tool.cc@1506
PS6, Line 1506:  i<max_moves_per_server_
nit: fix spacing


http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/tool_action_cluster.cc
File src/kudu/tools/tool_action_cluster.cc:

http://gerrit.cloudera.org:8080/#/c/14154/6/src/kudu/tools/tool_action_cluster.cc@131
PS6, Line 131:  If set true, "
             :             "the healthy ignored servers should be set into 
maintenance mode first.
While this is the most effective way to decommission a tserver, I don't think 
it should be a hard requirement of this tool. For instance, if in the future, 
we add a "decommissioning" state, I think we would want this flag to work as 
well.



--
To view, visit http://gerrit.cloudera.org:8080/14154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I86cfb740030946c13db1a9ca63d241f4907d6c89
Gerrit-Change-Number: 14154
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Comment-Date: Sat, 19 Oct 2019 02:37:19 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2914: Rebalance tool support moving replicas from some specific tablet servers

Reply via email to