[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-09-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738572#comment-14738572
 ] 

ASF subversion and git services commented on SOLR-7819:
---

Commit 1702213 from sha...@apache.org in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1702213 ]

SOLR-7819: ZK connection loss or session timeout do not stall indexing threads 
anymore and LIR activity is moved to a background thread

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.2, 5.2.1
>Reporter: Shalin Shekhar Mangar
>  Labels: Jepsen
> Fix For: Trunk, 5.4
>
> Attachments: SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch, 
> SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-09-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737321#comment-14737321
 ] 

ASF subversion and git services commented on SOLR-7819:
---

Commit 1702067 from sha...@apache.org in branch 'dev/trunk'
[ https://svn.apache.org/r1702067 ]

SOLR-7819: ZK connection loss or session timeout do not stall indexing threads 
anymore and LIR activity is moved to a background thread

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.2, 5.2.1
>Reporter: Shalin Shekhar Mangar
>  Labels: Jepsen
> Fix For: Trunk, 5.4
>
> Attachments: SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch, 
> SOLR-7819.patch, SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-08-02 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650674#comment-14650674
 ] 

Ramkumar Aiyengar commented on SOLR-7819:
-

A couple of comments, looks sensible overall..

{code}
  log.info("Node " + replicaNodeName +
  " is not live, so skipping leader-initiated recovery for replica: 
core={} coreNodeName={}",
  replicaCoreName, replicaCoreNodeName);
  // publishDownState will be false to avoid publishing the "down" state 
too many times
  // as many errors can occur together and will each call into this method 
(SOLR-6189)
{code}

It goes ahead and does `publishDownState` still if `forcePublishState` is true, 
is that intentional? The caller does check for if the replica is live, but 
there could a race. Similarly, if our state is suspect due to zk 
disconnect/session (the block before this), should the force be respected?

{code}
  // if the replica's state is not DOWN right now, make it so ...
  // we only really need to try to send the recovery command if the node 
itself is "live"
  if 
(getZkStateReader().getClusterState().liveNodesContain(replicaNodeName)) {

LeaderInitiatedRecoveryThread lirThread =
{code}

The comment doesn't make sense as the code has moved to LIRT.

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.2, 5.2.1
>Reporter: Shalin Shekhar Mangar
>  Labels: Jepsen
> Fix For: 5.3, Trunk
>
> Attachments: SOLR-7819.patch, SOLR-7819.patch
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-07-29 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646066#comment-14646066
 ] 

Shalin Shekhar Mangar commented on SOLR-7819:
-

Hmm, this last patch isn't quite right because it can create multiple LIR 
threads for the same replica on connection loss.

For example, I found the following in the logs in one of the nodes. Here 4 LIR 
threads were created to ask the same replica to recover:
{code}
2015-07-29 13:21:24.629 INFO  
(updateExecutor-2-thread-18-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 0 secs
2015-07-29 13:21:24.978 INFO  
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.978 WARN  
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to 
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on 
n1:8983_solr because core_node1 is no longer the leader! New leader is 
core_node2
2015-07-29 13:21:24.978 INFO  
(updateExecutor-2-thread-19-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 39 secs
2015-07-29 13:21:24.979 INFO  
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.979 WARN  
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to 
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on 
n1:8983_solr because core_node1 is no longer the leader! New leader is 
core_node2
2015-07-29 13:21:24.979 INFO  
(updateExecutor-2-thread-21-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 28 secs
2015-07-29 13:21:24.981 INFO  
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.c.ZkStateReader Updating data for jepsen5x3 to ver 95
2015-07-29 13:21:24.981 WARN  
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread Stop trying to send recovery command to 
downed replica core=jepsen5x3_shard2_replica1,coreNodeName=core_node2 on 
n1:8983_solr because core_node1 is no longer the leader! New leader is 
core_node2
2015-07-29 13:21:24.981 INFO  
(updateExecutor-2-thread-22-processing-x:jepsen5x3_shard2_replica2 r:core_node1 
http:n1:8983//solr//jepsen5x3_shard2_replica1// n:n5:8983_solr s:shard2 
c:jepsen5x3) [c:jepsen5x3 s:shard2 r:core_node1 x:jepsen5x3_shard2_replica2] 
o.a.s.c.LeaderInitiatedRecoveryThread 
LeaderInitiatedRecoveryThread-jepsen5x3_shard2_replica1 completed successfully 
after running for 33 secs
{code}

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>

[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-07-24 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640506#comment-14640506
 ] 

Shalin Shekhar Mangar commented on SOLR-7819:
-

bq. I think we already do this, look at DistributedUpdateProcessor.java around 
line 883, if we are unable to set the LIR node, we start a thread to keep 
retrying the node set.

Umm, it looks the reverse to me. If we are unable to set the LIR node or if 
there is an exception then sendRecoveryCommand=false and we do not create the 
LeaderInitiatedRecoveryThread at all?

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.2, 5.2.1
>Reporter: Shalin Shekhar Mangar
>  Labels: Jepsen
> Fix For: 5.3, Trunk
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-07-23 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639404#comment-14639404
 ] 

Ramkumar Aiyengar commented on SOLR-7819:
-

Duh, this is why we need a good test for this (I gave up after trying a bit in 
the original ticket), and I need to pay attention to automated merges more. 
Looks like my initial patch had the change, but when I merged with [your 
changes|https://svn.apache.org/viewvc?view=revision&revision=1666825] for 
SOLR-7109, looks like the local variable use just got removed :(

I get your concern, I think we already do this, look at 
DistributedUpdateProcessor.java around line 883, if we are unable to set the 
LIR node, we start a thread to keep retrying the node set. We just need to 
return false in the connection loss case as well, we currently do it only if 
the node is not live (and hence we didnt even bother setting the node).

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.2, 5.2.1
>Reporter: Shalin Shekhar Mangar
>  Labels: Jepsen
> Fix For: 5.3, Trunk
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7819) ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect retryOnConnLoss

2015-07-23 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14639380#comment-14639380
 ] 

Shalin Shekhar Mangar commented on SOLR-7819:
-

[~andyetitmoves] - It looks like the commits for SOLR-7245 only added a 
retryOnConnLoss parameter but it was never used inside the 
ZkController.updateLeaderInitiatedRecoveryState method?

Also, now that I am thinking about this change, is it really safe? For example, 
if a leader was not able to write to a 'live' replica, and during the LIR 
process if the leader couldn't complete a ZK operation (because 
retryOnConnLoss=false) then  LIR won't be set and updates can be missed. Also, 
the code as it is currently written, bails on a ConnectionLossException and 
doesn't even start a LIR thread which is bad.

I think not having a thread wait for LIR related activity is a noble cause but 
we should move the entire LIR logic to a background thread which must retry on 
connection loss until it either succeeds or a session expired exception is 
thrown.

Thoughts?

> ZkController.ensureReplicaInLeaderInitiatedRecovery does not respect 
> retryOnConnLoss
> 
>
> Key: SOLR-7819
> URL: https://issues.apache.org/jira/browse/SOLR-7819
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.2, 5.2.1
>Reporter: Shalin Shekhar Mangar
>  Labels: Jepsen
> Fix For: 5.3, Trunk
>
>
> SOLR-7245 added a retryOnConnLoss parameter to 
> ZkController.ensureReplicaInLeaderInitiatedRecovery so that indexing threads 
> do not hang during a partition on ZK operations. However, some of those 
> changes were unintentionally reverted by SOLR-7336 in 5.2.
> I found this while running Jepsen tests on 5.2.1 where a hung update managed 
> to put a leader into a 'down' state (I'm still investigating and will open a 
> separate issue about this problem).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org