[ 
https://issues.apache.org/jira/browse/KUDU-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jiacheng updated KUDU-3487:
--------------------------------
    Description: 
Function CheckCompleteReplace in replace rebalance will try to make the leader 
step down if the replica, which should be removed, is leader, but this may 
stuck for a while if the replication factor of the table is 1, since there is 
no voter to transfer leadership.

So it will be ok if we make sure voter num of the tablet is greater than 1 
before sending the LeaderStepDown request.

Here's a example:

I execute the following commands to move all the tablets of a tablet server out.

kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
kudu cluster rebalance master1,master2,master3  -ignored_tservers 
f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .

And it will stuck at a certain tablet for a while. 

it has been stuck for more than 10 minutes.

!image-2023-07-25-15-04-37-930.png!

The reason is that the tablet do leader step too early and stay in 
leader_transfer_in_progress_ status. Then master tries to send change config to 
add a peer but get refused by tablet server because of the 
leader_transfer_in_progress_ status.

!image-2023-07-25-15-11-16-505.png!

!image-2023-07-25-15-11-55-381.png!

  was:
Function CheckCompleteReplace in replace rebalance will try to make the leader 
step down if the replica, which should be removed, is leader, but this may 
stuck for a while if the replication factor of the table is 1, since there is 
no voter to transfer leadership.

So it will be ok if we make sure voter num of the tablet is greater than 1 
before sending the LeaderStepDown request.

Here's a example:

I execute the following commands to move all the tablets of a tablet server out.

kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
kudu cluster rebalance master1,master2,master3  -ignored_tservers 
f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .

And it will stuck at a certain tablet for a while. 

it has been stuck for more than 10 minutes.

!image-2023-07-25-15-04-37-930.png!

The reason is that the tablet do leader step too early and stay in 
leader_transfer_in_progress_ status. Then master tries to send change config to 
add a peer but get refused by tablet server because of the 
leader_transfer_in_progress_ status.

 


> Rebalancer: Balance for 1 replication factor tablet might stuck for leader 
> step down too early
> ----------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3487
>                 URL: https://issues.apache.org/jira/browse/KUDU-3487
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.14.0
>            Reporter: Song Jiacheng
>            Priority: Major
>         Attachments: 
> Fix_a_bug_that_replace_balance_for_1_replication_factor_tablet_might_stuck_for_leader_step.patch,
>  image-2023-07-25-15-04-37-930.png, image-2023-07-25-15-11-16-505.png, 
> image-2023-07-25-15-11-55-381.png
>
>
> Function CheckCompleteReplace in replace rebalance will try to make the 
> leader step down if the replica, which should be removed, is leader, but this 
> may stuck for a while if the replication factor of the table is 1, since 
> there is no voter to transfer leadership.
> So it will be ok if we make sure voter num of the tablet is greater than 1 
> before sending the LeaderStepDown request.
> Here's a example:
> I execute the following commands to move all the tablets of a tablet server 
> out.
> kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
> kudu cluster rebalance master1,master2,master3  -ignored_tservers 
> f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .
> And it will stuck at a certain tablet for a while. 
> it has been stuck for more than 10 minutes.
> !image-2023-07-25-15-04-37-930.png!
> The reason is that the tablet do leader step too early and stay in 
> leader_transfer_in_progress_ status. Then master tries to send change config 
> to add a peer but get refused by tablet server because of the 
> leader_transfer_in_progress_ status.
> !image-2023-07-25-15-11-16-505.png!
> !image-2023-07-25-15-11-55-381.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to