[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138805#comment-17138805
 ] 

Brittany Barnes commented on ZOOKEEPER-3036:
--------------------------------------------

Also encountered this issue in our production zookeeper cluster running 3.4.14 
(3 master 3 observer). 

Leader node (zoo3) gave the following errors:
{code:java}
2020-06-15 23:55:36,507 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo1>:50298:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:55:36,508 [myid:3] - WARN  [LearnerHandler-/<fqdn 
zoo1>:50298:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo1>:50298 ********

2020-06-15 23:55:52,704 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo2>:58112:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:55:52,711 [myid:3] - WARN  [LearnerHandler-/<fqdn 
zoo2>:58112:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo2>:58112 ********

2020-06-15 23:56:59,969 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo5>:56216:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.net.SocketTimeoutException: Read timed out 
    at java.net.SocketInputStream.socketRead0(Native Method) 
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
    at java.net.SocketInputStream.read(SocketInputStream.java:171) 
    at java.net.SocketInputStream.read(SocketInputStream.java:141) 
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:57:00,056 [myid:3] - WARN [LearnerHandler-/<fqdn 
zoo5>:56216:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo5>:56216 ********

2020-06-15 23:57:00,077 [myid:3] - WARN  [Sender-/<fqdn 
zoo1>:39530:LearnerHandler@235] - Unexpected exception at LearnerHandler 
Socket[addr=/<fqdn zoo1>,port=39530,localport=2888] 
tickOfNextAckDeadline:2409269 synced?:false queuedPacketLength:77
java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:215)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:59)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:517)

2020-06-15 23:57:00,076 [myid:3] - WARN [Sender-/<fqdn 
zoo2>:56076:LearnerHandler@235] - Unexpected exception at LearnerHandler 
Socket[addr=/<fqdn zoo2>,port=56076,localport=2888] 
tickOfNextAckDeadline:2409269 synced?:false queuedPacketLength:94
java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:215)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:59)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:517)

2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo1>:39530:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,070 [myid:3] - WARN  [LearnerHandler-/<fqdn 
zoo1>:39530:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo1>:39530 ********

2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo2>:56076:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,070 [myid:3] - WARN  [LearnerHandler-/<fqdn 
zoo2>:56076:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo2>:56076 ********

2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo4>:45138:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,071 [myid:3] - WARN  [LearnerHandler-/<fqdn 
zoo4>:45138:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo4>:45138 ********

2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo6>:35922:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,071 [myid:3] - WARN [LearnerHandler-/<fqdn 
zoo6>:35922:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo6>:35922 ********

2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn 
zoo3>:35922:LearnerHandler@648] - Unexpected exception causing shutdown while 
sock still open
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
    at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
    at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
    at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,069 [myid:3] - WARN [LearnerHandler-/<fqdn 
zoo3>:35922:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo3>:35922 
********{code}
 

Other servers showed logs like such (eg. zoo2):
{code:java}
2020-06-15 23:55:53,679 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x0: ZooKeeperServer not running
java.nio.channels.CancelledKeyException
    at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
    at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
    at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:212)
    at java.lang.Thread.run(Thread.java:748)

2020-06-15 23:55:53,677 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@243] - Ignoring 
unexpected runtime exception
2020-06-15 23:55:53,677 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@366] - 
CancelledKeyException causing close of session 0x1000001e6f7bd24
java.lang.Exception: shutdown Follower
    at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:986)

2020-06-15 23:55:53,439 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x1000001e6f7b701: Socket closed
2020-06-15 23:55:53,438 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x1000001e6f7b7a8: Broken pipe (Write failed)

java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139)

2020-06-15 23:55:53,429 [myid:2] - WARN  
[QuorumPeer[myid=2]/0.0.0.0:2181:Follower@90] - Exception when following the 
leader
2020-06-15 23:56:42,727 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x0: ZooKeeperServer not running
2020-06-15 23:56:42,587 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x0: ZooKeeperServer not running
2020-06-15 23:56:42,583 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x0: ZooKeeperServer not running
2020-06-15 23:56:42,582 [myid:2] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception 
causing close of session 0x0: ZooKeeperServer not running
java.lang.Exception: shutdown Follower
    at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:986)
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:56:42,562 [myid:2] - WARN  
[QuorumPeer[myid=2]/0.0.0.0:2181:Follower@90] - Exception when following the 
leader
{code}

This caused lots of outstanding requests to increase and continued until the 
following type of logs were seen: 
{code:java}
2020-06-16 00:23:44,120 [myid:2] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 3, 
my id = 2, error =
2020-06-16 00:23:44,095 [myid:1] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 3, 
my id = 1, error =
2020-06-16 00:16:52,044 [myid:2] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 1, 
my id = 2, error =
2020-06-16 00:16:52,004 [myid:3] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 1, 
my id = 3, error =
{code}
Which I know ZK will not recover from due to other bug 
https://issues.apache.org/jira/browse/ZOOKEEPER-2938 

The cluster was able to recover after long outage and many restarts. 

*Would it help to increase the maxSessionTimeout (perhaps doubling it) or would 
there be downsides?* I know it defaults to twice the tick time (our tick time 
is set to 2000).



 

> Unexpected exception in zookeeper
> ---------------------------------
>
>                 Key: ZOOKEEPER-3036
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3036
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.4.10
>         Environment: 3 Zookeepers, 5 kafka servers
>            Reporter: Oded
>            Priority: Critical
>
> We got an issue with one of the zookeeprs (Leader), causing the entire kafka 
> cluster to fail:
> 2018-05-09 02:29:01,730 [myid:3] - ERROR 
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@648] - Unexpected 
> exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
> 2018-05-09 02:29:01,730 [myid:3] - WARN  
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@661] - ******* GOODBYE 
> /192.168.0.91:42490 ********
>  
> We would expect that zookeeper will choose another Leader and the Kafka 
> cluster will continue to work as expected, but that was not the case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to