[
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shalin Shekhar Mangar updated SOLR-7819:
----------------------------------------
Attachment: SOLR-7819.patch
This patch moves all LIR related activity inside the LIR thread. The LIR thread
now publishes LIR state, publishes node state and then starts a recovery loop
depending on whether LIR state was published successfully or if it failed
because of session expiry or connection loss. The indexing thread only consults
the local replica map to ensure that only 1 LIR thread is started for any given
replica. This ensures that the indexing thread never needs to wait for ZK
operations needed for LIR. All tests pass except for
HttpPartitionTest.testLeaderInitiatedRecoveryCRUD whose assumptions about the
LIR workflow are no longer correct.
Still running more tests.
> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2, 5.2.1
> Reporter: Shalin Shekhar Mangar
> Labels: Jepsen
> Fix For: 5.3, Trunk
>
> Attachments: SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads
> do not hang during a partition on ZK operations. However, some of those
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed
> to put a leader into a 'down' state (I'm still investigating and will open a
> separate issue about this problem).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]