[
https://issues.apache.org/jira/browse/KUDU-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated KUDU-1778:
------------------------------
Status: In Review (was: Open)
> Consensus "stuck" after a leader election when both peers were divergent
> ------------------------------------------------------------------------
>
> Key: KUDU-1778
> URL: https://issues.apache.org/jira/browse/KUDU-1778
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 1.1.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> On a stress cluster we saw the following sequence of events following a
> service restart while under load:
> - a peer is elected leader successfully
> - both of its followers have divergent logs
> - when it connects to a new peer with a divergent log, it decides to fall
> back to index 0 rather than falling back to the proper committed index of
> that peer
> - upon falling back to index 0, will never succeed since the first segment of
> the log was already GCed long ago.
> Thus, the leader thinks that it needs to evict both of the followers and
> can't replicate to them, and the tablet gets "stuck".
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)