[ 
https://issues.apache.org/jira/browse/KUDU-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712794#comment-15712794
 ] 

Todd Lipcon commented on KUDU-1778:
-----------------------------------

Here are the interesting logs from the leader:
{code}
[LEADER]: Peer 275ece log is divergent from this leader: its last log entry 
110.11279 is not in this leader's log and it has not received anything from 
this leader yet. Falling back to committed index 0
[LEADER]: Connected to new peer: Peer: 275ece, Is new: false, Last received: 
0.0, Next index: 1, Last known committed idx: 0, Last exchange result: ERROR, 
Needs tablet copy: false
[LEADER]: Peer a1a2d4 log is divergent from this leader: its last log entry 
111.11350 is not in this leader's log and it has not received anything from 
this leader yet. Falling back to committed index 0
[LEADER]: Connected to new peer: Peer: a1a2d4, Is new: false, Last received: 
0.0, Next index: 1, Last known committed idx: 0, Last exchange result: ERROR, 
Needs tablet copy: false
{code}

On the followers I see reasonable status from their bootstrap logs:
{code}
I1201 03:03:39.364948 147662 tablet_bootstrap.cc:1019] T 
07b3624f00864ab18f984364ed6e2d11 P 275ece6d98a14be9b7dfcee3bec8d7a8: 
ReplayState: Previous OpId: 110.11279, Committed OpId: 108.11277, Pending 
Replicates: 2, Pending Commits: 0

... on the other node:

I1201 03:05:52.102141 168075 tablet_bootstrap.cc:1019] T 
07b3624f00864ab18f984364ed6e2d11 P a1a2d4b5585a4ac2a4d6e4d9a02fce6b: 
ReplayState: Previous OpId: 111.11350, Committed OpId: 111.11347, Pending 
Replicates: 3, Pending Commits: 0
{code}

It seems like perhaps the follower didn't send its committed index properly in 
the LMP mismatch error?

> Consensus "stuck" after a leader election when both peers were divergent
> ------------------------------------------------------------------------
>
>                 Key: KUDU-1778
>                 URL: https://issues.apache.org/jira/browse/KUDU-1778
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.1.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> On a stress cluster we saw the following sequence of events following a 
> service restart while under load:
> - a peer is elected leader successfully
> - both of its followers have divergent logs
> - when it connects to a new peer with a divergent log, it decides to fall 
> back to index 0 rather than falling back to the proper committed index of 
> that peer
> - upon falling back to index 0, will never succeed since the first segment of 
> the log was already GCed long ago.
> Thus, the leader thinks that it needs to evict both of the followers and 
> can't replicate to them, and the tablet gets "stuck".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to