Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14154 )

Change subject: KUDU-2914: Rebalance tool support moving replicas from some 
specific tablet servers
......................................................................


Patch Set 12:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc
File src/kudu/rebalance/rebalance-test.cc:

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@156
PS12, Line 156:     auto get_values = [](pair<ServersByCountMap::const_iterator,
const ?


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@157
PS12, Line 157:
nit: fix the alignment


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@197
PS12, Line 197: bool HasSameContents(const multimap<int32_t, TableBalanceInfo>& 
lhs,
              :                      const multimap<int32_t, TableBalanceInfo>& 
rhs) {
nit: is it possible to templatize bool HasSameContents(const 
ServersByCountMap&, const ServersByCountMap&) and this function so there would 
be less code duplication?


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@271
PS12, Line 271: ContainsIgnoredTserver
I think this method by itself doesn't imply working on ignored or other subset 
of tablet servers.  Maybe, renaming it to something like

  ContainsOneOf(...)

Also, move this method into the private section of this class since its 
signature and logic hints that it's very specific to this class and unlikely to 
be reused somewhere else.


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@293
PS12, Line 293:       
ASSERT_FALSE(ContainsIgnoredTserver(ci.balance.servers_by_total_replica_count,
I think it's worth adding a comment explaining the essence of the constrains 
being checked below.   Yes, it's obvious what's going on, but it would be nice 
to explain why.


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@296
PS12, Line 296:         
ASSERT_FALSE(ContainsIgnoredTserver(elem.second.servers_by_replica_count,
Add SCOPED_TRACE() to get more information on the context if ASSERT_FALSE() 
triggers.  Having TableBalanceInfo::tablet_id from 'elem' would be enough, I 
think.


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance-test.cc@527
PS12, Line 527: when movement of
if moving replicas from


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance_algo.h
File src/kudu/rebalance/rebalance_algo.h:

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalance_algo.h@92
PS12, Line 92: of
nit: on


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalancer.cc
File src/kudu/rebalance/rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/rebalance/rebalancer.cc@450
PS12, Line 450:       auto it = tserver_replicas_count.find(ignored_tserver);
              :       if (it == tserver_replicas_count.end()) {
Would FindOrNull() work here?


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc
File src/kudu/tools/rebalancer_tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@273
PS12, Line 273:       "--ignored_tservers=" +
              :       JoinStrings(ignored_tservers, ","),
Join these two strings.


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@701
PS12, Line 701: If detecting a ignored tserver went down
If an ignored tablet server went down


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@702
PS12, Line 702: servers
server


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@706
PS12, Line 706:  :
nit: move this to the next line


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@757
PS12, Line 757: IsRebalancingInProgress()
I think it might be important to know which tablet server is in progress of 
copying data (that's the why how this method works).

What if it's some other tablet server (not the ignored one) is in progress of 
copying data?  Or that's the server which is ignored?

In other words, do you know whether this test covers both cases below:

  * the ignored tablet server goes down during replica movement from it
  * the ignored tablet server goes down before rebalancer tries to move a 
replica from it

It would be nice to clarify on that.


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool-test.cc@795
PS12, Line 795:   
ASSERT_OK(cluster_->tablet_server(ignored_tserver_idx)->Restart());
Is there some significance behind starting the tablet server back?  If yes, 
please a comment explaining why.  If not, maybe drop this line.


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool.cc
File src/kudu/tools/rebalancer_tool.cc:

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool.cc@639
PS12, Line 639: healthy
healthy non-ignored


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/rebalancer_tool.cc@1243
PS12, Line 1243:     has_updates |= s.ok()
I'm not sure this is correct by the semantics of the 'has_updates' variable: if 
the method returned Status::OK(), that doesn't mean there has been an update.  
For update to be there, it's necessary to have 'is_complete' to be 'true'.

What is the motivation for this change?


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc
File src/kudu/tools/tool_action_cluster.cc:

http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc@131
PS12, Line 131: the tablet server
the source tablet server


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc@132
PS12, Line 132:             "This setting is effective only if the 
'--ignored_tservers' flag"
nit: add a space between 'flag' and '"'


http://gerrit.cloudera.org:8080/#/c/14154/12/src/kudu/tools/tool_action_cluster.cc@134
PS12, Line 134: replicas shouldn't be placed on all healthy ignored servers "
              :             "automatically, e.g. set them into 'mainetance 
mode' or 'decommision mode'.
then all ignored tablet servers must be placed into the 'maintenance mode'



--
To view, visit http://gerrit.cloudera.org:8080/14154
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I86cfb740030946c13db1a9ca63d241f4907d6c89
Gerrit-Change-Number: 14154
Gerrit-PatchSet: 12
Gerrit-Owner: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Comment-Date: Thu, 14 Nov 2019 02:19:02 +0000
Gerrit-HasComments: Yes

Reply via email to