[
https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144813#comment-17144813
]
Manuel commented on ZOOKEEPER-3822:
-----------------------------------
This issue is affecting my deployment of version 3.6.1 as well. It appears
that once a client disconnects, the zookeeper cluster continuously prints the
warning to log. My clients are running on the cloud in AWS ECS (Elastic
Container Service) on Fargate, each time they are updated they have a new local
IP address. The desired behavior is to print the warning once, and then move
on, clear the connection, and done forever. Currently, zookeeper prints the
same warnings forever *for each incident*, and they never clear up, so as time
goes, there will be more and more of these building up, and it's kind of an
exponential effect. It is a severe problem. I am going to have to revert to
version 3.5.X. There is definitely a bug, it should behave this way. It is
probably maintaining this exponential list of gone connections in memory and
eventually crash. It also fills our Cloudwatch logs with garbage which drives
up the bill.
> Zookeeper 3.6.1 EndOfStreamException
> ------------------------------------
>
> Key: ZOOKEEPER-3822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.6.1
> Reporter: Sebastian Schmitz
> Priority: Critical
> Attachments: kafka.log, kafka_test.log, zookeeper.log,
> zookeeper_test.log
>
>
> Hello,
> after Zookeeper 3.6.1 solved the issue with leader-election containing the IP
> and so causing it to fail in separate networks, like in our docker-setup I
> updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went
> smoothly and ran for one day. This night I had a new Update of the
> environment as we deploy as a whole package of all containers (Kafka,
> Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with
> latest ones. In this case, there was no change, the containers were just
> removed and deployed again. As the config and data of zookeeper is not stored
> inside the containers that's not a problem but this night it broke the whole
> clusters of Zookeeper and so also Kafka was down.
> * zookeeper_node_1 was stopped and the container removed and created again
> * zookeeper_node_1 starts up and the election takes place
> * zookeeper_node_2 is elected as leader again
> * zookeeper_node_2 is stopped and the container removed and created again
> * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
> * zookeeper_node_2 starts up and zookeeper_node_3 remains leader
> And from there all servers just report
> 2020-05-07 14:07:57,187 [myid:3] - WARN
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07
> 14:07:57,187 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] -
> Unexpected exceptionEndOfStreamException: Unable to read additional data from
> client, it probably closed the socket: address = /z.z.z.z:46060, session =
> 0x2014386bbde0000 at
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source) at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> and don't recover.
> I was able to recover the cluster in Test-Environment by stopping and
> starting all the zookeeper-nodes. The cluster in dev is still in that state
> and I'm checking the logs to find out more...
> The full logs of the deployment of Zookeeper and Kafka that started at 02:00
> are attached. The first time in local NZ-time and the second one is UTC. the
> IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for
> node_3
> The Kafka-Servers are running on the same machine. Which means that the
> EndOfStreamEceptions could also be connections from Kafka as I don't think
> that zookeeper_node_3 establish a session with itself?
>
> Edit:
> I just found some interesting log from Test-Environment:
> zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO
> [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client
> /f.f.f.f:42012 as it has seen zxid 0xc600000000 our last zxid is 0xc5000028f8
> client must try another server
> zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO
> [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client
> /f.f.f.f:51506 as it has seen zxid 0xc600000000 our last zxid is 0xc5000028f8
> client must try another server
> These entried are repeated there before the EndOfStreamException shows up...
> I found that was set by zookeeper_node_3:
> zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):Leader@1501] Have
> quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last
> processed zxid: 0xc600000000
> zookeeper_node_3: 2020-05-07 14:10:12,587 [myid:3] INFO
> [LearnerHandler-/z.z.z.z:60156:LearnerHandler@800] Synchronizing with Learner
> sid: 2 maxCommittedLog=0xc5000028f8 minCommittedLog=0xc500002704
> lastProcessedZxid=0xc600000000 peerLastZxid=0xc5000028f8
> It looks like this update of zxid didn't reach nodes 1 and 2 and so they
> refuse connections from clients?
> I also added the full logs for our test-environment now
--
This message was sent by Atlassian Jira
(v8.3.4#803005)