[jira] [Updated] (SOLR-7989) Down replica elected leader, stays down after successful election

Ishan Chattopadhyaya (JIRA) Tue, 10 Nov 2015 21:48:31 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ishan Chattopadhyaya updated SOLR-7989:
---------------------------------------
    Attachment: SOLR-7569.patch

bq. Shouldn't we just publish active regardless?
That's what I wanted to do in my initial patch. Though, upon's Noble's comment 
to add the check, I thought it would help reduce one overseer message and be 
more efficient.

bq. Why do we use the stale clusterstate to see if we are already active and 
prevent publishing active if we are not?
What do you think we should do, do you suggest (1) we force update the cluster 
state before the check so that we don't check against stale clusterstate, or 
(2) send the active state message regardless?

Attaching the patch for (1), this required a change to the LeaderElectionTest. 

To do (2), it would require a change to OverseerTest.testOverseerStatsReset 
(SOLR-8249), and I don't currently know how to make it work if the STATE=ACTIVE 
message is sent regardless. If that's the way you suggest we should go, maybe I 
could raise a patch to send the message without a state check and disable the 
test for now.

> Down replica elected leader, stays down after successful election
> -----------------------------------------------------------------
>
>                 Key: SOLR-7989
>                 URL: https://issues.apache.org/jira/browse/SOLR-7989
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Ishan Chattopadhyaya
>            Assignee: Noble Paul
>             Fix For: 5.4, Trunk
>
>         Attachments: DownLeaderTest.java, DownLeaderTest.java, 
> SOLR-7569.patch, SOLR-7989.patch, SOLR-7989.patch, SOLR-7989.patch, 
> SOLR-7989.patch, SOLR-8233.patch
>
>
> It is possible that a down replica gets elected as a leader, and that it 
> stays down after the election.
> Here's how I hit upon this:
> * There are 3 replicas: leader, notleader0, notleader1
> * Introduced network partition to isolate notleader0, notleader1 from leader 
> (leader puts these two in LIR via zk).
> * Kill leader, remove partition. Now leader is dead, and both of notleader0 
> and notleader1 are down. There is no leader.
> * Remove LIR znodes in zk.
> * Wait a while, and there happens a (flawed?) leader election.
> * Finally, the state is such that one of notleader0 or notleader1 (which were 
> down before) become leader, but stays down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-7989) Down replica elected leader, stays down after successful election

Reply via email to