[ 
https://issues.apache.org/jira/browse/SOLR-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-7109:
----------------------------------------
    Attachment: SOLR-7109.patch

Patch updated to trunk.

I've been testing with Jepsen on trunk with this patch and it has worked very 
well. Unfortunately, writing a junit test to simulate this failure is proving 
to be very difficult. I'm inclined to commit this as-is for now so a code 
review would be appreciated.

Also note that this patch doesn't solve the problem of the threads being stuck 
but it only ensures that the LIR state is written only if the ephemeral 
sequential election node of the current leader exists.

> Indexing threads stuck during network partition can put leader into down state
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-7109
>                 URL: https://issues.apache.org/jira/browse/SOLR-7109
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.3, 5.0
>            Reporter: Shalin Shekhar Mangar
>             Fix For: Trunk, 5.1
>
>         Attachments: SOLR-7109.patch, SOLR-7109.patch
>
>
> I found this recently while running some Jepsen tests. I found that some 
> threads get stuck on zk operations for a long time in 
> ZkController.updateLeaderInitiatedRecoveryState method and when they wake up 
> they go ahead with setting the LIR state to down. But in the mean time, new 
> leader has been elected and sometimes you'd get into a state where the leader 
> itself is put into recovery causing the shard to reject all writes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to