[ 
https://issues.apache.org/jira/browse/KUDU-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong resolved KUDU-2548.
-------------------------------
    Fix Version/s: 1.10.0
       Resolution: Fixed

Users can run the rebalancer with permanently dead tservers by running with the 
`–ignore_tservers` option.

> Rebalancer tool should be able to run even if there are permanently dead 
> tablet servers
> ---------------------------------------------------------------------------------------
>
>                 Key: KUDU-2548
>                 URL: https://issues.apache.org/jira/browse/KUDU-2548
>             Project: Kudu
>          Issue Type: Improvement
>    Affects Versions: 1.7.1
>            Reporter: William Berkeley
>            Assignee: William Berkeley
>            Priority: Major
>             Fix For: 1.10.0
>
>
> The rebalancer will bail as soon as it sees a down tablet server, including 
> at the beginning before it does rebalancing. There's a few reasons for this:
> 1. Rebalancing shouldn't fight with re-replication. If a tablet server is 
> down for a while, all its replicas will need to be re-replicated. Since 
> rebalancing is greedy and can be interrupted or resumed anytime, it's better 
> to exit, allow re-replication to occur, and then resume rebalancing.
> 2. It's more complicated to figure out how to balance correctly with a greedy 
> algorithm if tablet servers can come and go, since coming and going changes 
> the balance state of the cluster. We allow TS to join the cluster and will 
> begin to move replicas there, but if we allow TS to go down we ought to think 
> about handling if they come back. It's easier to leave solving this problem 
> for when rebalancing and re-replication are somewhat unified in the master.
> Nevertheless, it's a bummer that if, e.g., a user decom'd a tserver 3 months 
> ago, the rebalancer won't run because the rebalancer's ksck says a tserver is 
> unavailable. We can fix this very cleanly once proper decommissioning has 
> been implemented- with a distinction between "gone missing" and 
> "decommissioned", we can have the RB tool (really ksck) ignore decom'd 
> servers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to