Yuqi Du has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19608 )

Change subject: [master] Exclude tservers in MAINTENANCE_MODE when leader 
rebalancing
......................................................................


Patch Set 10:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/19608/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19608/9//COMMIT_MSG@14
PS9, Line 14: For this reason, we should exclude
            : such tservers.
> Having this patch is great, but it seems there is a race condition in this
Yes.
I want to use the 'quiesce state' of tserver to do this at next patch. But this 
status is a little different from 'Maintenance mode'

Maintenance mode is a state of kudu-master, tserver don't know it is in 
'Maintenance'. So a method extends the state to tserver.

Tserver's quiesce state, I will study it again and then provide a solution to 
solve this problem.


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc
File src/kudu/master/auto_leader_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@234
PS9, Line 234:
             :   constexpr const int currentTserverIndex = 0;
             :   tserver::MiniTabletServer* mini_tserver = 
cluster_->mini_tablet_server(currentTserverIndex);
             :   // Sets the tserver state for a tserver to 'MAINTENANCE_MODE'.
             :   ASSERT_OK(
             :       master->ts_manager()->SetTServerState(mini_tserver->uuid(),
             :                                             
TServerStatePB::MAINTENANCE_MODE,
             :                                             
ChangeTServerStateRequestPB::ALLOW_MISSING_TSERVER,
             :                                             master->catalog_mana
> +1
Done


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@251
PS9, Line 251: // it's enough to reach
> Wrap this into ASSERT_OK()?
Done


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@253
PS9, Line 253:   constexpr const int32_t retries = 20;
> nit: please add a comment on the purpose of this pause
// To make sure replica_rebalancer execute some runs and reach balanced.
Because this test case tserver only 3, so replica is balanced, the SleepFor is 
not necessary, so remove it.


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@254
PS9, Line 254: {
> Why 20?  Why not 10 or 100?
It's an estimated value.

// Try to run 20 tries 'leader rebalance'. If mini_tserver not in 
MAINTENANCE_MODE,
  // it's enough to reach leader balanced, more tries is not necessary and less 
tries
  // may not reach leader rebalanced.


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer-test.cc@264
PS9, Line 264: tatus.IsIllegalState()) <
> Does it make sense to check for exact Status code?  And error message patte
Done


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.h
File src/kudu/master/auto_leader_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.h@79
PS9, Line 79: std::unordered_set<st
> Is it crucial to have the set of UUIDs to be ordered?  If not, consider usi
Done


http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.cc
File src/kudu/master/auto_leader_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/19608/9/src/kudu/master/auto_leader_rebalancer.cc@405
PS9, Line 405:     CatalogManager::ScopedLeaderSharedLock 
leader_lock(catalog_manager_);
> What if one more tablet server is put into the maintenance mode just betwee
Thanks for your advice. The problem exist indeed, unless we use a big lock 
ts_manager's SetTServerState when leader rebalancer runs. A big lock is not 
proper. So I strengthen the check condition and at most 1 tablet will be 
influenced when a new tserver entered MAINTENANCE_MODE in extreme situations .


After this patch, a planning patch is for tserver quiesce state , which often 
used by decommission with enter maintenance mode, if a tserver in quiesce state 
receive a leader transferring request to it, it should reject the request.



--
To view, visit http://gerrit.cloudera.org:8080/19608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2f85a675e69fd02a62e2625881dad2ca5e27acd9
Gerrit-Change-Number: 19608
Gerrit-PatchSet: 10
Gerrit-Owner: Yuqi Du <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: KeDeng <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Yuqi Du <[email protected]>
Gerrit-Comment-Date: Thu, 30 Mar 2023 08:28:53 +0000
Gerrit-HasComments: Yes

Reply via email to