Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 20:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14177/20/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/20/src/kudu/master/auto_rebalancer-test.cc@457
PS20, Line 457: // Verify that movement of replicas to meet the replication 
factor
              : // does not count towards rebalancing, i.e. the auto-rebalancer 
will
              : // not consider recovering replicas as candidates for replica 
movement.
BTW, how do we know that the absence of attempts to rebalance is not due to the 
fact that replica distribution is de-facto even and auto-rebalancer sees that 
and schedules no replica movements?

I think more reliable scenario to justify this description (if I understand it 
correctly), would be having only 3 tablet servers in the beginning when all 
tablets are being created.  Then add a new tablet server and shutdown one of 
the 3 original tablet servers.

Also, re-replication kicks in only after 
--follower_unavailable_considered_failed_sec interval has passed tablet server 
becomes unavailable (default is 300).  So, if the idea was to catch a few 
re-replicated replicas in progress, it would be necessary to shorten that 
interval as well.


http://gerrit.cloudera.org:8080/#/c/14177/20/src/kudu/master/auto_rebalancer-test.cc@462
PS20, Line 462: FLAGS_tserver_unresponsive_timeout_ms
It's a separate topic (i.e. not a topic for this particular scenario), but did 
you know what would happen with scenario like this if leaving the timeout as is 
(i.e. 60 * 1000 ms)?

Basically, what if the information about failure of a tablet server is not yet 
accounted by the TSManager and the auto-rebalancer tries to schedule a replica 
movement to/from the downed tablet server?  The idea is to make sure that the 
failed replica movement attempt is handled as expected by the auto-rebalancer.  
Do you think it's possible to a scenario for this?



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 20
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 11 Mar 2020 20:48:29 +0000
Gerrit-HasComments: Yes

Reply via email to