[jira] [Created] (ZOOKEEPER-2167) Restarting current leader node sometimes results in a permanent loss of quorum

Mike Lundy (JIRA) Tue, 14 Apr 2015 17:03:26 -0700

Mike Lundy created ZOOKEEPER-2167:
-------------------------------------

             Summary: Restarting current leader node sometimes results in a 
permanent loss of quorum
                 Key: ZOOKEEPER-2167
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2167
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.4.6
            Reporter: Mike Lundy
         Attachments: fails-to-rejoin-quorum.gz


I'm seeing an issue where a restart of the current leader node results in a 
long-term / permanent loss of quorum (I've only waited 30 minutes, but it 
doesn't look like it's making any progress). Restarting the same instance 
_again_ seems to resolve the problem.

To me, this looks a lot like the issue described in 
https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this 
separately for the moment in case I am wrong.

Notes on the attached log:
1) If you search for XXX in the log, you'll see where I've annotated it to 
include where the process was told to terminate, when it is reported to have 
completed that, and then the same for the start
2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
zid=1, ip=10.20.0.19
zid=2, ip=10.20.0.18
zid=3, ip=10.20.0.20
zid=4, ip=10.20.0.21
zid=5, ip=10.20.0.22
3) It's important to note that this is log is during the process of a rolling 
service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is the 
one being removed, so if you see a conspicuous silence from that service, 
that's why. 
4) I've been unable to reproduce this problem _except_ during cluster size 
changes, so I suspect that may be related; it's also important to note that 
this test is going from 5 -> 4 (which means, since we remove one and then do a 
rolling restart, we are actually temporarily dropping to 3). I know is not a 
recommended thing (this is more of a stress test). We have seen this same 
problem on larger cluster sizes, it just seems easier to reproduce it on 
smaller sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (ZOOKEEPER-2167) Restarting current leader node sometimes results in a permanent loss of quorum

Reply via email to