[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068558#comment-17068558
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3769:
---------------------------------------------

Thanks, this is an important observation. It also means that the first patched 
ZooKeeper I shared with you will not help in all cases.

I will create a more complete patch and send it to you later today. I will also 
try to write some unit tests to generate malformed / partial notification 
messages.

But I am still confused why no one have seen this problem before. Is there 
anything special in your networking setup? Are you maybe using some tunnels?

Also: are you starting the cluster with an empty snapshot, or you always boot 
the cluster from an existing snapshot when you do these tests? Have you used 
dynamic reconfig in the cluster before?

> fast leader election does not end if leader is taken down
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-3769
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.5.7
>            Reporter: Lasaro Camargos
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>         Attachments: node1.log, node2.log, node3.log
>
>
> In a cluster with three nodes, node3 is the leader and the other nodes are 
> followers.
> If I stop node3, the other two nodes do not finish the leader election.
> This is happening with ZK 3.5.7,  openjdk version "12.0.2" 2019-07-16, and 
> this config
>  
> tickTime=2000
>  initLimit=30
>  syncLimit=3
>  dataDir=/company/service/data
>  dataLogDir=/company/service/log
>  clientPort=2181
>  snapCount=100000
>  autopurge.snapRetainCount=3
>  autopurge.purgeInterval=1
>  skipACL=yes
>  preAllocSize=65536
>  maxClientCnxns=0
>  4lw.commands.whitelist=*
>  admin.enableServer=false
> server.1=companydemo1.snc4.companyinc.com:3000:4000
>  server.2=companydemo2.snc4.companyinc.com:3000:4000
>  server.3=companydemo3.snc4.companyinc.com:3000:4000
>  
> Could you have a look at the logs and help me figure this out? It seems like 
> node 1 is not getting notifications back from node2, but I don't see anything 
> wrong with the network so I am wondering if bugs like  ZOOKEEPER-3756 could 
> be causing it.
>  
> In the logs, node3 is killed at 11:17:14
> node2 is killed at 11:17:50 2 and node 1 at 11:18:02 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to