[ https://issues.apache.org/jira/browse/KUDU-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon updated KUDU-1778: ------------------------------ Resolution: Fixed Fix Version/s: 1.2.0 Status: Resolved (was: In Review) > Consensus "stuck" after a leader election when both peers were divergent > ------------------------------------------------------------------------ > > Key: KUDU-1778 > URL: https://issues.apache.org/jira/browse/KUDU-1778 > Project: Kudu > Issue Type: Bug > Components: consensus > Affects Versions: 1.1.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Critical > Fix For: 1.2.0 > > > On a stress cluster we saw the following sequence of events following a > service restart while under load: > - a peer is elected leader successfully > - both of its followers have divergent logs > - when it connects to a new peer with a divergent log, it decides to fall > back to index 0 rather than falling back to the proper committed index of > that peer > - upon falling back to index 0, will never succeed since the first segment of > the log was already GCed long ago. > Thus, the leader thinks that it needs to evict both of the followers and > can't replicate to them, and the tablet gets "stuck". -- This message was sent by Atlassian JIRA (v6.3.4#6332)