[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738572#comment-14738572 ] ASF subversion and git services commented on SOLR-7819: --- Commit 1702213 from sha...@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1702213 ] SOLR-7819: ZK connection loss or session timeout do not stall indexing threads anymore and LIR activity is moved to a background thread > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.2, 5.2.1 >Reporter: Shalin Shekhar Mangar > Labels: Jepsen > Fix For: Trunk, 5.4 > > Attachments: SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch, > SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch > > > SOLR-7245 added a retryOnConnLoss parameter to > ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads > do not hang during a partition on ZK operations. However, some of those > changes were unintentionally reverted by SOLR-7336 in 5.2. > I found this while running Jepsen tests on 5.2.1 where a hung update managed > to put a leader into a 'down' state (I'm still investigating and will open a > separate issue about this problem). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737321#comment-14737321 ] ASF subversion and git services commented on SOLR-7819: --- Commit 1702067 from sha...@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1702067 ] SOLR-7819: ZK connection loss or session timeout do not stall indexing threads anymore and LIR activity is moved to a background thread > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.2, 5.2.1 >Reporter: Shalin Shekhar Mangar > Labels: Jepsen > Fix For: Trunk, 5.4 > > Attachments: SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch, > SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch > > > SOLR-7245 added a retryOnConnLoss parameter to > ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads > do not hang during a partition on ZK operations. However, some of those > changes were unintentionally reverted by SOLR-7336 in 5.2. > I found this while running Jepsen tests on 5.2.1 where a hung update managed > to put a leader into a 'down' state (I'm still investigating and will open a > separate issue about this problem). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650674#comment-14650674 ] Ramkumar Aiyengar commented on SOLR-7819: - A couple of comments, looks sensible overall.. {code} log.info("Node " + replicaNodeName + " is not live, so skipping leader-initiated recovery for replica: core={} coreNodeName={}", replicaCoreName, replicaCoreNodeName); // publishDownState will be false to avoid publishing the "down" state too many times // as many errors can occur together and will each call into this method (SOLR-6189) {code} It goes ahead and does `publishDownState` still if `forcePublishState` is true, is that intentional? The caller does check for if the replica is live, but there could a race. Similarly, if our state is suspect due to zk disconnect/session (the block before this), should the force be respected? {code} // if the replica's state is not DOWN right now, make it so ... // we only really need to try to send the recovery command if the node itself is "live" if (getZkStateReader().getClusterState().liveNodesContain(replicaNodeName)) { LeaderInitiatedRecoveryThread lirThread = {code} The comment doesn't make sense as the code has moved to LIRT. > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.2, 5.2.1 >Reporter: Shalin Shekhar Mangar > Labels: Jepsen > Fix For: 5.3, Trunk > > Attachments: SOLR-7819.patch, SOLR-7819.patch > > > SOLR-7245 added a retryOnConnLoss parameter to > ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads > do not hang during a partition on ZK operations. However, some of those > changes were unintentionally reverted by SOLR-7336 in 5.2. > I found this while running Jepsen tests on 5.2.1 where a hung update managed > to put a leader into a 'down' state (I'm still investigating and will open a > separate issue about this problem). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646066#comment-14646066 ] Shalin Shekhar Mangar commented on SOLR-7819: - Hmm, this last patch isn't quite right because it can create multiple LIR threads for the same replica on connection loss. For example, I found the following in the logs in one of the nodes. Here 4 LIR threads were created to ask the same replica to recover: {code} 2015-07-29 13:21:24.629 INFO (updateExecutor-2-thread-18-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 0 secs 2015-07-29 13:21:24.978 INFO (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95 2015-07-29 13:21:24.978 WARN (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2 2015-07-29 13:21:24.978 INFO (updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 39 secs 2015-07-29 13:21:24.979 INFO (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95 2015-07-29 13:21:24.979 WARN (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2 2015-07-29 13:21:24.979 INFO (updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 28 secs 2015-07-29 13:21:24.981 INFO (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95 2015-07-29 13:21:24.981 WARN (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on n1:8983_solr because core_node1 is no longer the leader! New leader is core_node2 2015-07-29 13:21:24.981 INFO (updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] o.a.s.c.LeaderInitiatedRecoveryThread LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully after running for 33 secs {code} > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr >
[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640506#comment-14640506 ] Shalin Shekhar Mangar commented on SOLR-7819: - bq. I think we already do this, look at DistributedUpdateProcessor.java around line 883, if we are unable to set the LIR node, we start a thread to keep retrying the node set. Umm, it looks the reverse to me. If we are unable to set the LIR node or if there is an exception then sendRecoveryCommand=false and we do not create the LeaderInitiatedRecoveryThread at all? > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.2, 5.2.1 >Reporter: Shalin Shekhar Mangar > Labels: Jepsen > Fix For: 5.3, Trunk > > > SOLR-7245 added a retryOnConnLoss parameter to > ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads > do not hang during a partition on ZK operations. However, some of those > changes were unintentionally reverted by SOLR-7336 in 5.2. > I found this while running Jepsen tests on 5.2.1 where a hung update managed > to put a leader into a 'down' state (I'm still investigating and will open a > separate issue about this problem). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639404#comment-14639404 ] Ramkumar Aiyengar commented on SOLR-7819: - Duh, this is why we need a good test for this (I gave up after trying a bit in the original ticket), and I need to pay attention to automated merges more. Looks like my initial patch had the change, but when I merged with [your changes|https://svn.apache.org/viewvc?view=revision&revision=1666825] for SOLR-7109, looks like the local variable use just got removed :( I get your concern, I think we already do this, look at DistributedUpdateProcessor.java around line 883, if we are unable to set the LIR node, we start a thread to keep retrying the node set. We just need to return false in the connection loss case as well, we currently do it only if the node is not live (and hence we didnt even bother setting the node). > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.2, 5.2.1 >Reporter: Shalin Shekhar Mangar > Labels: Jepsen > Fix For: 5.3, Trunk > > > SOLR-7245 added a retryOnConnLoss parameter to > ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads > do not hang during a partition on ZK operations. However, some of those > changes were unintentionally reverted by SOLR-7336 in 5.2. > I found this while running Jepsen tests on 5.2.1 where a hung update managed > to put a leader into a 'down' state (I'm still investigating and will open a > separate issue about this problem). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss
[ https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639380#comment-14639380 ] Shalin Shekhar Mangar commented on SOLR-7819: - [~andyetitmoves] - It looks like the commits for SOLR-7245 only added a retryOnConnLoss parameter but it was never used inside the ZkController.updateLeaderInitiatedRecoveryState method? Also, now that I am thinking about this change, is it really safe? For example, if a leader was not able to write to a 'live' replica, and during the LIR process if the leader couldn't complete a ZK operation (because retryOnConnLoss=false) then LIR won't be set and updates can be missed. Also, the code as it is currently written, bails on a ConnectionLossException and doesn't even start a LIR thread which is bad. I think not having a thread wait for LIR related activity is a noble cause but we should move the entire LIR logic to a background thread which must retry on connection loss until it either succeeds or a session expired exception is thrown. Thoughts? > ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect > retryOnConnLoss > > > Key: SOLR-7819 > URL: https://issues.apache.org/jira/browse/SOLR-7819 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.2, 5.2.1 >Reporter: Shalin Shekhar Mangar > Labels: Jepsen > Fix For: 5.3, Trunk > > > SOLR-7245 added a retryOnConnLoss parameter to > ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads > do not hang during a partition on ZK operations. However, some of those > changes were unintentionally reverted by SOLR-7336 in 5.2. > I found this while running Jepsen tests on 5.2.1 where a hung update managed > to put a leader into a 'down' state (I'm still investigating and will open a > separate issue about this problem). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org