[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066710#comment-17066710
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3769:
---------------------------------------------

Thanks for the logs and all the details! I still try to understand what exactly 
happened here.

Unfortunately the logs for Server 3 are ended around 11:16:15. Maybe not all 
the log messages got flushed to the log file.

The first thing I found is that after killing Server 3, both Server 1 and 2 are 
starting a new leader election (around 11:17:14) and sending their votes to 
each other, but the Server 1 never received the votes of Server 2 while Server 
2 receives the votes of Server 1.

The second interesting thing is that after 11:17:14 only Server 2 tries to 
establish connection periodically to Server 3. Server 1 doesn't realize that 
the connection is broken.

Something is really odd with Server 1...

There is an important error for Server 1, which can explains maybe the problem:
{code:java}
03/24/20 11:16:16,297 [WorkerReceiver[myid=1]] ERROR 
[org.apache.zookeeper.server.NIOServerCnxnFactory] 
(NIOServerCnxnFactory.java:92) - Thread Thread[WorkerReceiver[myid=1],5,main] 
died

java.nio.BufferUnderflowException: null
     at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:178) ~[?:?]
     at java.nio.ByteBuffer.get(ByteBuffer.java:762) ~[?:?]
     at 
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:278)
 ~[zookeeper-3.5.7.jar:3.5.7]                                                   
                             
{code}

As far as I can tell, this exception (and the serve 1 and 3 log messages before 
this time) suggests that somehow a partial message got transferred from Server 
3 to Server 1 causing some unexpected problems. Also for some reason the 
connection got broken between Server 3 and Server 1 way before 11:17:14.

This is a strange scenario, I haven't seen it yet. I will continue the 
investigation.

At the moment my guess is that this should be some OS / JDK / network related 
issue that triggers some unhandled exception in ZooKeeper. Would it be possible 
for you to try out two things?
 - does the same problem appear with an older Java version? (e.g. latest 
OpenJDK 8) I'm wondering why we don't handle this particular exception in the 
code, maybe this is something we haven't seen before with the older JDK 
versions?
 - does the same problem appear with using the 
{{zookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory}}
 in zoo.cfg? In the new ZooKeeper Netty is also supported (actually it is 
necessary to switch to Netty, if someone wants to use SSL in ZooKeeper). Maybe 
using Netty instead of NIO would solve the problem.

If any of these workarounds would help, then still we have an issue in 
ZooKeeper (at least about handling this exception correctly). But it would help 
in the investigation to see if any of these changes would fix the problem.

> fast leader election does not end if leader is taken down
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-3769
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.5.7
>            Reporter: Lasaro Camargos
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>         Attachments: node1.log, node2.log, node3.log
>
>
> In a cluster with three nodes, node3 is the leader and the other nodes are 
> followers.
> If I stop node3, the other two nodes do not finish the leader election.
> This is happening with ZK 3.5.7,  openjdk version "12.0.2" 2019-07-16, and 
> this config
>  
> tickTime=2000
>  initLimit=30
>  syncLimit=3
>  dataDir=/hedvig/hpod/data
>  dataLogDir=/hedvig/hpod/log
>  clientPort=2181
>  snapCount=100000
>  autopurge.snapRetainCount=3
>  autopurge.purgeInterval=1
>  skipACL=yes
>  preAllocSize=65536
>  maxClientCnxns=0
>  4lw.commands.whitelist=*
>  admin.enableServer=false
> server.1=companydemo1.snc4.companyinc.com:3000:4000
>  server.2=companydemo2.snc4.companyinc.com:3000:4000
>  server.3=companydemo3.snc4.companyinc.com:3000:4000
>  
> Could you have a look at the logs and help me figure this out? It seems like 
> node 1 is not getting notifications back from node2, but I don't see anything 
> wrong with the network so I am wondering if bugs like  ZOOKEEPER-3756 could 
> be causing it.
>  
> In the logs, node3 is killed at 11:17:14
> node2 is killed at 11:17:50 2 and node 1 at 11:18:02 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to