[
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646066#comment-14646066
]
Shalin Shekhar Mangar commented on SOLR-7819:
---------------------------------------------
Hmm, this last patch isn't quite right because it can create multiple LIR
threads for the same replica on connection loss.
For example, I found the following in the logs in one of the nodes. Here 4 LIR
threads were created to ask the same replica to recover:
{code}
2015-07-29 13:21:24.629 INFO
(updateExecutor-2-thread-18-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully
after running for 0 secs
2015-07-29 13:21:24.978 INFO
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.978 WARN
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on
n1:8983_solr because core_node1 is no longer the leader! New leader is
core_node2
2015-07-29 13:21:24.978 INFO
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully
after running for 39 secs
2015-07-29 13:21:24.979 INFO
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.979 WARN
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on
n1:8983_solr because core_node1 is no longer the leader! New leader is
core_node2
2015-07-29 13:21:24.979 INFO
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully
after running for 28 secs
2015-07-29 13:21:24.981 INFO
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.981 WARN
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on
n1:8983_solr because core_node1 is no longer the leader! New leader is
core_node2
2015-07-29 13:21:24.981 INFO
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1
http:////n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2]
o.a.s.c.LeaderInitiatedRecoveryThread
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully
after running for 33 secs
{code}
> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect
> retryOnConnLoss
> ------------------------------------------------------------------------------------
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.2, 5.2.1
> Reporter: Shalin Shekhar Mangar
> Labels: Jepsen
> Fix For: 5.3, Trunk
>
> Attachments: SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads
> do not hang during a partition on ZK operations. However, some of those
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed
> to put a leader into a 'down' state (I'm still investigating and will open a
> separate issue about this problem).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]