[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109616#comment-17109616
 ] 

Sebastian Schmitz commented on ZOOKEEPER-3822:
----------------------------------------------

Hello Mate,

Yes, it is reproducible. I can deploy 3.4.14 again and everything is fine. When 
I upgrade the 3.4.14 to 3.6.1 the first run completes without any issues. 
However, as soon as I deploy the 3.6.1 again this issue comes up.

There is a wait between deployment of the different nodes because the script 
waits for Kafka to complete startup which usually takes 2-5 minutes to 
complete. That should be enough time for Zookeeper to get everything sorted?

The cluster doesn't recover means that in general Kafka and Zookeeper figure 
out a problem automatically and solve it. It might just take 5 minutes longer, 
but then the cluster would recover and start working again. In this case that 
didn't happen. My deployments of the Kafka-Zookeeper-Cluster in Dev-Environment 
is scheduled to run at 02:00 AM. When I came to the office at 07:00 it was 
still in that state and that problem shows up as soon as the second node is 
being deployed. The re-deployment of the first node completes without issues.

I'm not sure what causes this, but it happened every time when I re-deployed 
the 3.6.1.

Right now my workaround to get it back into a healthy state is to stop all 
zookeepers and start them in Order 1-2-3 again.

I just did a re-deployment in Dev after the upgrade from 3.4.14 to 3.6.1 was 
done Saturday-night. Node 1 completed without problems. When Node 2 was 
upgraded the problem started and even the Kafka-node-1 which was connected to 
Zookeeper before it got disconnected and the log in Zookeeper just showed these 
Entries:
{code:java}
May 18 07:27:01 zookeeper_node_1: 2020-05-17 19:19:35,653 [myid:1] - INFO  
[NIOWorkerThread-4:Learner@137] - Revalidating client: 0x201759453520000
May 18 07:27:01 zookeeper_node_1: 2020-05-17 19:19:35,653 [myid:1] - INFO  
[NIOWorkerThread-4:Learner@137] - Revalidating client: 0x201759453520000
May 18 07:27:01 zookeeper_node_1: 2020-05-17 19:19:47,660 [myid:1] - WARN  
[NIOWorkerThread-8:NIOServerCnxn@364] - Unexpected exception
May 18 07:27:01 zookeeper_node_1: EndOfStreamException: Unable to read 
additional data from client, it probably closed the socket: address = 
/x.x.x.x:50842, session = 0x201759453520000
May 18 07:27:01 zookeeper_node_1:  at 
org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
May 18 07:27:01 zookeeper_node_1:  at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
May 18 07:27:01 zookeeper_node_1:  at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
May 18 07:27:01 zookeeper_node_1:  at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
May 18 07:27:01 zookeeper_node_1:  at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
May 18 07:27:01 zookeeper_node_1:  at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
May 18 07:27:01 zookeeper_node_1:  at java.base/java.lang.Thread.run(Unknown 
Source)
{code}
My Dev-Environments is ready to help to debug this. Just tell me what to 
do/check or even which branch of Zookeeper to build and deploy.

Best regards

Sebastian

> Zookeeper 3.6.1 EndOfStreamException
> ------------------------------------
>
>                 Key: ZOOKEEPER-3822
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.1
>            Reporter: Sebastian Schmitz
>            Priority: Critical
>         Attachments: kafka.log, kafka_test.log, zookeeper.log, 
> zookeeper_test.log
>
>
> Hello,
> after Zookeeper 3.6.1 solved the issue with leader-election containing the IP 
> and so causing it to fail in separate networks, like in our docker-setup I 
> updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went 
> smoothly and ran for one day. This night I had a new Update of the 
> environment as we deploy as a whole package of all containers (Kafka, 
> Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with 
> latest ones. In this case, there was no change, the containers were just 
> removed and deployed again. As the config and data of zookeeper is not stored 
> inside the containers that's not a problem but this night it broke the whole 
> clusters of Zookeeper and so also Kafka was down.
>  * zookeeper_node_1 was stopped and the container removed and created again
>  * zookeeper_node_1 starts up and the election takes place
>  * zookeeper_node_2 is elected as leader again
>  * zookeeper_node_2 is stopped and the container removed and created again
>  * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
>  * zookeeper_node_2 starts up and zookeeper_node_3 remains leader
> And from there all servers just report
> 2020-05-07 14:07:57,187 [myid:3] - WARN  
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 
> 14:07:57,187 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - 
> Unexpected exceptionEndOfStreamException: Unable to read additional data from 
> client, it probably closed the socket: address = /z.z.z.z:46060, session = 
> 0x2014386bbde0000 at 
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)  
> at java.base/java.lang.Thread.run(Unknown Source)
> and don't recover.
> I was able to recover the cluster in Test-Environment by stopping and 
> starting all the zookeeper-nodes. The cluster in dev is still in that state 
> and I'm checking the logs to find out more...
> The full logs of the deployment of Zookeeper and Kafka that started at 02:00 
> are attached. The first time in local NZ-time and the second one is UTC. the 
> IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for 
> node_3
> The Kafka-Servers are running on the same machine. Which means that the 
> EndOfStreamEceptions could also be connections from Kafka as I don't think 
> that zookeeper_node_3 establish a session with itself?
>  
> Edit:
>  I just found some interesting log from Test-Environment:
>  zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO  
> [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:42012 as it has seen zxid 0xc600000000 our last zxid is 0xc5000028f8 
> client must try another server
>  zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO  
> [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:51506 as it has seen zxid 0xc600000000 our last zxid is 0xc5000028f8 
> client must try another server
>  These entried are repeated there before the EndOfStreamException shows up...
>  I found that was set by zookeeper_node_3:
>  zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):Leader@1501] Have 
> quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last 
> processed zxid: 0xc600000000
>  zookeeper_node_3: 2020-05-07 14:10:12,587 [myid:3] INFO  
> [LearnerHandler-/z.z.z.z:60156:LearnerHandler@800] Synchronizing with Learner 
> sid: 2 maxCommittedLog=0xc5000028f8 minCommittedLog=0xc500002704 
> lastProcessedZxid=0xc600000000 peerLastZxid=0xc5000028f8
>  It looks like this update of zxid didn't reach nodes 1 and 2 and so they 
> refuse connections from clients?
> I also added the full logs for our test-environment now



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to