[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068558#comment-17068558 ]
Mate Szalay-Beko commented on ZOOKEEPER-3769: --------------------------------------------- Thanks, this is an important observation. It also means that the first patched ZooKeeper I shared with you will not help in all cases. I will create a more complete patch and send it to you later today. I will also try to write some unit tests to generate malformed / partial notification messages. But I am still confused why no one have seen this problem before. Is there anything special in your networking setup? Are you maybe using some tunnels? Also: are you starting the cluster with an empty snapshot, or you always boot the cluster from an existing snapshot when you do these tests? Have you used dynamic reconfig in the cluster before? > fast leader election does not end if leader is taken down > --------------------------------------------------------- > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.5.7 > Reporter: Lasaro Camargos > Assignee: Mate Szalay-Beko > Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/company/service/data > dataLogDir=/company/service/log > clientPort=2181 > snapCount=100000 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)