[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067546#comment-17067546 ]
Mate Szalay-Beko commented on ZOOKEEPER-3769: --------------------------------------------- I created a patched 3.5.7 version, where the exception is caught and the malformed message is skipped. Can you maybe try out this version? https://drive.google.com/open?id=1cTdusaEFIVvH2D5KSrj6M9VVJoqlaQwD This should print out a warning to the log after catching the exception: {{Skipping the processing of a partial / malformed response message sent by sid=XXX}} > fast leader election does not end if leader is taken down > --------------------------------------------------------- > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.5.7 > Reporter: Lasaro Camargos > Assignee: Mate Szalay-Beko > Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=100000 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)