[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-7819:
----------------------------------------
    Attachment: SOLR-7819.patch

Here's a patch which:
# Adds retryOnConnLoss in ZkController's 
ensureReplicaInLeaderInitiatedRecovery, updateLeaderInitiatedRecoveryState and 
markShardAsDownIfLeader method.
# Starts a LIR thread if leader cannot mark replica as down on connection loss. 
Earlier a session loss or connection loss both would skip starting the LIR 
thread.

I'm still running Solr's integration and jepsen tests.

This causes a subtle change in behavior which is best analyzed with two 
different scenarios:
# Leader fails to send an update to replica but also suffers a temporary blip 
in its ZK connection during the DistributedUpdateProcessor's doFinish method
## Currently, a few indexing threads will hang but eventually succeed in 
marking the 'replica' as down and the leader will start a new LIR thread to ask 
the replica to recover.
## With this patch, the indexing threads do not hang but a connection loss 
exception is thrown. At this point, we started a new LIR thread to ask the 
replica to recover. Although this removes the safety of explicitly marking the 
'replica' as down, the LIR thread does provide us a timeout-based safety of 
making sure that the replica does recover from the leader.
# Leader fails to send an update to replica but also suffers a long network 
partition between itself and ZK server during DUP.doFinish method.
## Currently, a few indexing threads will hang in 
ZkController.ensureReplicaInLeaderInitiatedRecovery until the ZK operations 
time out because of connection loss or session loss and no LIR thread will be 
created. This seems okay because the current connection loss timeout value is 
higher than ZK session expiration time and session loss means that ZK has 
determined that our session has expired already. In both cases, a new leader 
election should have happened and there's no need to put the replica as 'down'.
## With this patch, the difference is that the indexing threads do not hang and 
the ensureReplicaInLeaderInitiatedRecovery returns immediately with a 
connection loss exception. A new LIR thread *is* started in this scenario. This 
is also fine because we were not able to mark the replica as 'down' and we 
aren't sure that the session has expired so it is important that we start the 
LIR thread to ask the replica to recover. Even if a new leader has been 
elected, there's no major harm done by asking the replica to recover.

So, net-net this patch doesn't seem to introduce any new problems in the system.

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7819
>                 URL: https://issues.apache.org/jira/browse/SOLR-7819
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2, 5.2.1
>            Reporter: Shalin Shekhar Mangar
>              Labels: Jepsen
>             Fix For: 5.3, Trunk
>
>         Attachments: SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to