Sebastian Schmitz created ZOOKEEPER-3822: --------------------------------------------
Summary: Zookeeper 3.6.1 EndOfStreamException Key: ZOOKEEPER-3822 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.6.1 Reporter: Sebastian Schmitz Attachments: zookeeper.log Hello, after Zookeeper 3.6.1 solved the issue with leader-election containing the IP and so causing it to fail in separate networks, like in our docker-setup I updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went smoothly and ran for one day. This night I had a new Update of the environment as we deploy as a whole package of all containers (Kafka, Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with latest ones. In this case, there was no change, the containers were just removed and deployed again. As the config and data of zookeeper is not stored inside the containers that's not a problem but this night it broke the whole clusters of Zookeeper and so also Kafka was down. * zookeeper_node_1 was stopped and the container removed and created again * zookeeper_node_1 starts up and the election takes place * zookeeper_node_2 is elected as leader again * zookeeper_node_2 is stopped and the container removed and created again * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down * zookeeper_node_2 starts up and zookeeper_node_3 remains leader And from there all servers just report 2020-05-07 14:07:57,187 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 14:07:57,187 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exceptionEndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /z.z.z.z:46060, session = 0x2014386bbde0000 at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) and don't recover. I was able to recover the cluster in Test-Environment by stopping and starting all the zookeeper-nodes. The cluster in dev is still in that state and I'm checking the logs to find out more... The full log of the deployment that started at 02:00 is attached. The first time in local NZ-time and the second one is UTC. the IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for node_3 The Kafka-Servers are running on the same machine. Which means that the EndOfStreamEceptions could also be connections from Kafka as I don't think that zookeeper_node_3 establish a session with itself? -- This message was sent by Atlassian Jira (v8.3.4#803005)