[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650674#comment-14650674
 ] 

Ramkumar Aiyengar commented on SOLR-7819:
-----------------------------------------

A couple of comments, looks sensible overall..

{code}
      log.info("Node " + replicaNodeName +
              " is not live, so skipping leader-initiated recovery for replica: 
core={} coreNodeName={}",
          replicaCoreName, replicaCoreNodeName);
      // publishDownState will be false to avoid publishing the "down" state 
too many times
      // as many errors can occur together and will each call into this method 
(SOLR-6189)
{code}

It goes ahead and does `publishDownState` still if `forcePublishState` is true, 
is that intentional? The caller does check for if the replica is live, but 
there could a race. Similarly, if our state is suspect due to zk 
disconnect/session (the block before this), should the force be respected?

{code}
      // if the replica's state is not DOWN right now, make it so ...
      // we only really need to try to send the recovery command if the node 
itself is "live"
      if 
(getZkStateReader().getClusterState().liveNodesContain(replicaNodeName)) {

        LeaderInitiatedRecoveryThread lirThread =
{code}

The comment doesn't make sense as the code has moved to LIRT.

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7819
>                 URL: https://issues.apache.org/jira/browse/SOLR-7819
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2, 5.2.1
>            Reporter: Shalin Shekhar Mangar
>              Labels: Jepsen
>             Fix For: 5.3, Trunk
>
>         Attachments: SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to