[
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650674#comment-14650674
]
Ramkumar Aiyengar commented on SOLR-7819:
-----------------------------------------
A couple of comments, looks sensible overall..
{code}
log.info("Node " + replicaNodeName +
" is not live, so skipping leader-initiated recovery for replica:
core={} coreNodeName={}",
replicaCoreName, replicaCoreNodeName);
// publishDownState will be false to avoid publishing the "down" state
too many times
// as many errors can occur together and will each call into this method
(SOLR-6189)
{code}
It goes ahead and does `publishDownState` still if `forcePublishState` is true,
is that intentional? The caller does check for if the replica is live, but
there could a race. Similarly, if our state is suspect due to zk
disconnect/session (the block before this), should the force be respected?
{code}
// if the replica's state is not DOWN right now, make it so ...
// we only really need to try to send the recovery command if the node
itself is "live"
if
(getZkStateReader().getClusterState().liveNodesContain(replicaNodeName)) {
LeaderInitiatedRecoveryThread lirThread =
{code}
The comment doesn't make sense as the code has moved to LIRT.
> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2, 5.2.1
> Reporter: Shalin Shekhar Mangar
> Labels: Jepsen
> Fix For: 5.3, Trunk
>
> Attachments: SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads
> do not hang during a partition on ZK operations. However, some of those
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed
> to put a leader into a 'down' state (I'm still investigating and will open a
> separate issue about this problem).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]