Todd Lipcon created KUDU-1778:

             Summary: Consensus "stuck" after a leader election when both peers 
were divergent
                 Key: KUDU-1778
             Project: Kudu
          Issue Type: Bug
          Components: consensus
    Affects Versions: 1.1.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon
            Priority: Critical

On a stress cluster we saw the following sequence of events following a service 
restart while under load:
- a peer is elected leader successfully
- both of its followers have divergent logs
- when it connects to a new peer with a divergent log, it decides to fall back 
to index 0 rather than falling back to the proper committed index of that peer
- upon falling back to index 0, will never succeed since the first segment of 
the log was already GCed long ago.

Thus, the leader thinks that it needs to evict both of the followers and can't 
replicate to them, and the tablet gets "stuck".

This message was sent by Atlassian JIRA

Reply via email to