[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646066#comment-14646066
 ] 

Shalin Shekhar Mangar commented on SOLR-7819:
---------------------------------------------

Hmm, this last patch isn't quite right because it can create multiple LIR 
threads for the same replica on connection loss.

For example, I found the following in the logs in one of the nodes. Here 4 LIR 
threads were created to ask the same replica to recover:
{code}
2015-07-29 13:21:24.629 INFO  
(updateExecutor-2-thread-18-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 0 secs
2015-07-29 13:21:24.978 INFO  
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.978 WARN  
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to 
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on 
n1:8983_solr because core_node1 is no longer the leader! New leader is 
core_node2
2015-07-29 13:21:24.978 INFO  
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 39 secs
2015-07-29 13:21:24.979 INFO  
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.979 WARN  
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to 
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on 
n1:8983_solr because core_node1 is no longer the leader! New leader is 
core_node2
2015-07-29 13:21:24.979 INFO  
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 28 secs
2015-07-29 13:21:24.981 INFO  
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.981 WARN  
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to 
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on 
n1:8983_solr because core_node1 is no longer the leader! New leader is 
core_node2
2015-07-29 13:21:24.981 INFO  
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 33 secs
{code}

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7819
>                 URL: https://issues.apache.org/jira/browse/SOLR-7819
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.2, 5.2.1
>            Reporter: Shalin Shekhar Mangar
>              Labels: Jepsen
>             Fix For: 5.3, Trunk
>
>         Attachments: SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to