[ 
https://issues.apache.org/jira/browse/KUDU-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2800:
--------------------------------
    Labels: newbie++  (was: )

> Avoid 'unintended' re-replication of long-bootstrapping tablet replicas
> -----------------------------------------------------------------------
>
>                 Key: KUDU-2800
>                 URL: https://issues.apache.org/jira/browse/KUDU-2800
>             Project: Kudu
>          Issue Type: Improvement
>          Components: consensus, tserver
>    Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.9.1, 1.10.0
>            Reporter: Alexey Serbin
>            Priority: Major
>              Labels: newbie++
>
> As implemented in
> https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576
>  , the logic for tracking 'health' of tablet replicas cannot differentiate 
> between bootstrapping and failed replicas.
> As a result, if a tablet replica is bootstrapping for times longer than the 
> interval specified by {{--follower_unavailable_considered_failed_sec}} 
> run-time flag, the system can start the process of re-replication of the 
> tablet replica elsewhere.
> One option might be sending a special {{PeerStatus}} for a bootstrapping 
> replica with a response to a Raft message sent by a leader replica and 
> updating the logic referenced above.  The response might also include 
> additional information on the current progress of the bootstrap process.  
> Probably, we need add a separate timeout to track a stale bootstrapping 
> replica, so its health would be reported as FAILED after the leader observes 
> the replica being stuck in bootstrapping with no forward progress for a time 
> interval longer than the timeout specified by the new parameter.
> However, the approach above requires the Raft consensus object for a 
> bootstrapping replica to be at least partially functional, so it entails 
> reading at least some information about a replica from the on-disk consensus 
> metadata prior to proper bootstrapping of a tablet replica by a tablet server.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to