[
https://issues.apache.org/jira/browse/KUDU-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Song Jiacheng updated KUDU-3487:
--------------------------------
Attachment: image-2023-07-25-15-11-55-381.png
> Rebalancer: Balance for 1 replication factor tablet might stuck for leader
> step down too early
> ----------------------------------------------------------------------------------------------
>
> Key: KUDU-3487
> URL: https://issues.apache.org/jira/browse/KUDU-3487
> Project: Kudu
> Issue Type: Bug
> Affects Versions: 1.14.0
> Reporter: Song Jiacheng
> Priority: Major
> Attachments:
> Fix_a_bug_that_replace_balance_for_1_replication_factor_tablet_might_stuck_for_leader_step.patch,
> image-2023-07-25-15-04-37-930.png, image-2023-07-25-15-11-16-505.png,
> image-2023-07-25-15-11-55-381.png
>
>
> Function CheckCompleteReplace in replace rebalance will try to make the
> leader step down if the replica, which should be removed, is leader, but this
> may stuck for a while if the replication factor of the table is 1, since
> there is no voter to transfer leadership.
> So it will be ok if we make sure voter num of the tablet is greater than 1
> before sending the LeaderStepDown request.
> Here's a example:
> I execute the following commands to move all the tablets of a tablet server
> out.
> kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
> kudu cluster rebalance master1,master2,master3 -ignored_tservers
> f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .
> And it will stuck at a certain tablet for a while.
> it has been stuck for more than 10 minutes.
> !image-2023-07-25-15-04-37-930.png!
> The reason is that the tablet do leader step too early and stay in
> leader_transfer_in_progress_ status. Then master tries to send change config
> to add a peer but get refused by tablet server because of the
> leader_transfer_in_progress_ status.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)