Mike Lundy created ZOOKEEPER-2167:
-------------------------------------
Summary: Restarting current leader node sometimes results in a
permanent loss of quorum
Key: ZOOKEEPER-2167
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2167
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.4.6
Reporter: Mike Lundy
Attachments: fails-to-rejoin-quorum.gz
I'm seeing an issue where a restart of the current leader node results in a
long-term / permanent loss of quorum (I've only waited 30 minutes, but it
doesn't look like it's making any progress). Restarting the same instance
_again_ seems to resolve the problem.
To me, this looks a lot like the issue described in
https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this
separately for the moment in case I am wrong.
Notes on the attached log:
1) If you search for XXX in the log, you'll see where I've annotated it to
include where the process was told to terminate, when it is reported to have
completed that, and then the same for the start
2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
zid=1, ip=10.20.0.19
zid=2, ip=10.20.0.18
zid=3, ip=10.20.0.20
zid=4, ip=10.20.0.21
zid=5, ip=10.20.0.22
3) It's important to note that this is log is during the process of a rolling
service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is the
one being removed, so if you see a conspicuous silence from that service,
that's why.
4) I've been unable to reproduce this problem _except_ during cluster size
changes, so I suspect that may be related; it's also important to note that
this test is going from 5 -> 4 (which means, since we remove one and then do a
rolling restart, we are actually temporarily dropping to 3). I know is not a
recommended thing (this is more of a stress test). We have seen this same
problem on larger cluster sizes, it just seems easier to reproduce it on
smaller sizes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)