[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Lundy updated ZOOKEEPER-2167:
----------------------------------
    Attachment: fails-to-rejoin-quorum.gz

> Restarting current leader node sometimes results in a permanent loss of quorum
> ------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2167
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2167
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.6
>            Reporter: Mike Lundy
>         Attachments: fails-to-rejoin-quorum.gz
>
>
> I'm seeing an issue where a restart of the current leader node results in a 
> long-term / permanent loss of quorum (I've only waited 30 minutes, but it 
> doesn't look like it's making any progress). Restarting the same instance 
> _again_ seems to resolve the problem.
> To me, this looks a lot like the issue described in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this 
> separately for the moment in case I am wrong.
> Notes on the attached log:
> 1) If you search for XXX in the log, you'll see where I've annotated it to 
> include where the process was told to terminate, when it is reported to have 
> completed that, and then the same for the start
> 2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
> zid=1, ip=10.20.0.19
> zid=2, ip=10.20.0.18
> zid=3, ip=10.20.0.20
> zid=4, ip=10.20.0.21
> zid=5, ip=10.20.0.22
> 3) It's important to note that this is log is during the process of a rolling 
> service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is 
> the one being removed, so if you see a conspicuous silence from that service, 
> that's why. 
> 4) I've been unable to reproduce this problem _except_ during cluster size 
> changes, so I suspect that may be related; it's also important to note that 
> this test is going from 5 -> 4 (which means, since we remove one and then do 
> a rolling restart, we are actually temporarily dropping to 3). I know is not 
> a recommended thing (this is more of a stress test). We have seen this same 
> problem on larger cluster sizes, it just seems easier to reproduce it on 
> smaller sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to