[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067521#comment-17067521 ]
Mate Szalay-Beko commented on ZOOKEEPER-3769: --------------------------------------------- I was trying to reproduce the issue using ZooKeeper 3.5.7 and OpenJDK 12.0.2 with: - this compose file: https://github.com/symat/zookeeper-docker-test/blob/master/3_nodes_zk_jdk_12.yml - this config (based on your config): https://github.com/symat/zookeeper-docker-test/blob/master/conf/ZOOKEEPER-3769_zoo.cfg I used OpenJDK 12.0.2 runtime in the docker containers. And I was trying out ZooKeeper 3.5.7 compiled both with 8u424 and with 12.0.2. Unfortunately everything was working fine... I haven't seen the BufferUnderflowException and the quorum was up quickly after I stopped the container of Server 3 (which was the leader perviously). Maybe it is an OS / networking related thing which can not be simulated with docker on a single machine. Anyway, I will create a patched version to handle this exception. > fast leader election does not end if leader is taken down > --------------------------------------------------------- > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.5.7 > Reporter: Lasaro Camargos > Assignee: Mate Szalay-Beko > Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=100000 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)