[jira] [Commented] (SOLR-6511) Fencepost error in LeaderInitiatedRecoveryThread

Alan Woodward (JIRA) Fri, 12 Sep 2014 10:25:55 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131777#comment-14131777
 ]


Alan Woodward commented on SOLR-6511:
-------------------------------------

So here's how this manifested:
* replica1 is busy sending updates to replica2 when it gets a network blip and 
it's ZK connection times out
* replica2 is then elected leader
* replica1 also still thinks it's leader (because it hasn't noticed the ZK 
timeout yet) and then gets errors back from replica2 saying "I'm the leader, 
stop sending me these updates!"
* replica1 interprets these as errors, and attempts to put replica2 into 
leader-initiated recovery
* what ought to happen here is that replica2 sends a message back saying "no 
need, I'm the leader, I'll take it from here, thanks".  But because of the 
fencepost error, the message to replica2 is never actually sent, and replica1 
then writes replica2's state as DOWN into the LIRT zk node
* the two replicas send each other some request-recover messages, trying to 
work out who is actually leader
* replica2 then tries to recover, but it can't publish itself as active, 
because you can't do that if your LIRT state is DOWN, so it eventually goes 
into RECOVERY_FAILED

There is a bunch of fairly confusing logging around all this as well.  I 
particularly liked the messages that said "WaitingForState recovering, but I 
see state: recovering" :-)

> Fencepost error in LeaderInitiatedRecoveryThread
> ------------------------------------------------
>
>                 Key: SOLR-6511
>                 URL: https://issues.apache.org/jira/browse/SOLR-6511
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Alan Woodward
>
> At line 106:
> {code}
>     while (continueTrying && ++tries < maxTries) {
> {code}
> should be
> {code}
>     while (continueTrying && ++tries <= maxTries) {
> {code}
> This is only a problem when called from DistributedUpdateProcessor, as it can 
> have maxTries set to 1, which means the loop is never actually run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6511) Fencepost error in LeaderInitiatedRecoveryThread

Reply via email to