Todd Lipcon updated KUDU-1778:
    Status: In Review  (was: Open)

> Consensus "stuck" after a leader election when both peers were divergent
> ------------------------------------------------------------------------
>                 Key: KUDU-1778
>                 URL: https://issues.apache.org/jira/browse/KUDU-1778
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.1.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
> On a stress cluster we saw the following sequence of events following a 
> service restart while under load:
> - a peer is elected leader successfully
> - both of its followers have divergent logs
> - when it connects to a new peer with a divergent log, it decides to fall 
> back to index 0 rather than falling back to the proper committed index of 
> that peer
> - upon falling back to index 0, will never succeed since the first segment of 
> the log was already GCed long ago.
> Thus, the leader thinks that it needs to evict both of the followers and 
> can't replicate to them, and the tablet gets "stuck".

This message was sent by Atlassian JIRA

Reply via email to