[jira] [Updated] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

Shalin Shekhar Mangar (JIRA) Fri, 31 Jul 2015 12:17:24 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shalin Shekhar Mangar updated SOLR-7819:
----------------------------------------
    Attachment: SOLR-7819.patch

This patch moves all LIR related activity inside the LIR thread. The LIR thread 
now publishes LIR state, publishes node state and then starts a recovery loop 
depending on whether LIR state was published successfully or if it failed 
because of session expiry or connection loss. The indexing thread only consults 
the local replica map to ensure that only 1 LIR thread is started for any given 
replica. This ensures that the indexing thread never needs to wait for ZK 
operations needed for LIR. All tests pass except for 
HttpPartitionTest.testLeaderInitiatedRecoveryCRUD whose assumptions about the 
LIR workflow are no longer correct.

Still running more tests.

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7819
>                 URL: https://issues.apache.org/jira/browse/SOLR-7819
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2, 5.2.1
>            Reporter: Shalin Shekhar Mangar
>              Labels: Jepsen
>             Fix For: 5.3, Trunk
>
>         Attachments: SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

Reply via email to