Todd Lipcon commented on KUDU-1778:

Actually noticed this in the follower logs:

I1201 03:05:52.344377 168075 consensus_queue.cc:167] T 
07b3624f00864ab18f984364ed6e2d11 P a1a2d4b5585a4ac2a4d6e4d9a02fce6b 
[NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, 
Majority replicated index: 0, Committed index: 0, Last appended: 111.11350, 
Current term: 0, Majority size: -1, State: 1, Mode: NON_LEADER

so I guess that if a replica starts up and has a divergent log, it will return 
committed_index=0 unless it gets elected leader first?

> Consensus "stuck" after a leader election when both peers were divergent
> ------------------------------------------------------------------------
>                 Key: KUDU-1778
>                 URL: https://issues.apache.org/jira/browse/KUDU-1778
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.1.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
> On a stress cluster we saw the following sequence of events following a 
> service restart while under load:
> - a peer is elected leader successfully
> - both of its followers have divergent logs
> - when it connects to a new peer with a divergent log, it decides to fall 
> back to index 0 rather than falling back to the proper committed index of 
> that peer
> - upon falling back to index 0, will never succeed since the first segment of 
> the log was already GCed long ago.
> Thus, the leader thinks that it needs to evict both of the followers and 
> can't replicate to them, and the tablet gets "stuck".

This message was sent by Atlassian JIRA

Reply via email to