[ https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876156#comment-14876156 ]
Ramkumar Aiyengar commented on SOLR-8069: ----------------------------------------- The case we hit was when we cold stopped/started the cloud. This was on 4.10.4, so may not be valid now. Let's say you have R1 and R2. * R1 is the leader and both R1 and R2 are stopped at the same time. * R2's stops accepting requests but hasn't updated ZK as yet, when R1 sends a update to R2, it fails and puts R2 in LIR. * R2 shuts down first, then R1. * R1 starts up first, finds it should be the leader. * R2 decides it should follow and tries to recover. * R1 decides it can't be leader due to LIR and steps down. But by then R2 is in recovery, doesn't step up, and we have no one stepping forward. > Leader Initiated Recovery can put the replica with the latest data into LIR > and a shard will have no leader even on restart. > ---------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-8069 > URL: https://issues.apache.org/jira/browse/SOLR-8069 > Project: Solr > Issue Type: Bug > Reporter: Mark Miller > Attachments: SOLR-8069.patch, SOLR-8069.patch > > > I've seen this twice now. Need to work on a test. > When some issues hit all the replicas at once, you can end up in a situation > where the rightful leader was put or put itself into LIR. Even on restart, > this rightful leader won't take leadership and you have to manually clear the > LIR nodes. > It seems that if all the replicas participate in election on startup, LIR > should just be cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org