[
https://issues.apache.org/jira/browse/ZOOKEEPER-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138805#comment-17138805
]
Brittany Barnes commented on ZOOKEEPER-3036:
--------------------------------------------
Also encountered this issue in our production zookeeper cluster running 3.4.14
(3 master 3 observer).
Leader node (zoo3) gave the following errors:
{code:java}
2020-06-15 23:55:36,507 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo1>:50298:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:55:36,508 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo1>:50298:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo1>:50298 ********
2020-06-15 23:55:52,704 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo2>:58112:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:55:52,711 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo2>:58112:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo2>:58112 ********
2020-06-15 23:56:59,969 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo5>:56216:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:57:00,056 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo5>:56216:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo5>:56216 ********
2020-06-15 23:57:00,077 [myid:3] - WARN [Sender-/<fqdn
zoo1>:39530:LearnerHandler@235] - Unexpected exception at LearnerHandler
Socket[addr=/<fqdn zoo1>,port=39530,localport=2888]
tickOfNextAckDeadline:2409269 synced?:false queuedPacketLength:77
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:215)
at
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:59)
at
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:517)
2020-06-15 23:57:00,076 [myid:3] - WARN [Sender-/<fqdn
zoo2>:56076:LearnerHandler@235] - Unexpected exception at LearnerHandler
Socket[addr=/<fqdn zoo2>,port=56076,localport=2888]
tickOfNextAckDeadline:2409269 synced?:false queuedPacketLength:94
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:215)
at
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:59)
at
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:517)
2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo1>:39530:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,070 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo1>:39530:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo1>:39530 ********
2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo2>:56076:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,070 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo2>:56076:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo2>:56076 ********
2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo4>:45138:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,071 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo4>:45138:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo4>:45138 ********
2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo6>:35922:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,071 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo6>:35922:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo6>:35922 ********
2020-06-15 23:57:00,049 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo3>:35922:LearnerHandler@648] - Unexpected exception causing shutdown while
sock still open
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-06-15 23:57:00,069 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo3>:35922:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo3>:35922
********{code}
Other servers showed logs like such (eg. zoo2):
{code:java}
2020-06-15 23:55:53,679 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x0: ZooKeeperServer not running
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:212)
at java.lang.Thread.run(Thread.java:748)
2020-06-15 23:55:53,677 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@243] - Ignoring
unexpected runtime exception
2020-06-15 23:55:53,677 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@366] -
CancelledKeyException causing close of session 0x1000001e6f7bd24
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:986)
2020-06-15 23:55:53,439 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x1000001e6f7b701: Socket closed
2020-06-15 23:55:53,438 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x1000001e6f7b7a8: Broken pipe (Write failed)
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139)
2020-06-15 23:55:53,429 [myid:2] - WARN
[QuorumPeer[myid=2]/0.0.0.0:2181:Follower@90] - Exception when following the
leader
2020-06-15 23:56:42,727 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x0: ZooKeeperServer not running
2020-06-15 23:56:42,587 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x0: ZooKeeperServer not running
2020-06-15 23:56:42,583 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x0: ZooKeeperServer not running
2020-06-15 23:56:42,582 [myid:2] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception
causing close of session 0x0: ZooKeeperServer not running
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:986)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-06-15 23:56:42,562 [myid:2] - WARN
[QuorumPeer[myid=2]/0.0.0.0:2181:Follower@90] - Exception when following the
leader
{code}
This caused lots of outstanding requests to increase and continued until the
following type of logs were seen:
{code:java}
2020-06-16 00:23:44,120 [myid:2] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 3,
my id = 2, error =
2020-06-16 00:23:44,095 [myid:1] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 3,
my id = 1, error =
2020-06-16 00:16:52,044 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 1,
my id = 2, error =
2020-06-16 00:16:52,004 [myid:3] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id 1,
my id = 3, error =
{code}
Which I know ZK will not recover from due to other bug
https://issues.apache.org/jira/browse/ZOOKEEPER-2938
The cluster was able to recover after long outage and many restarts.
*Would it help to increase the maxSessionTimeout (perhaps doubling it) or would
there be downsides?* I know it defaults to twice the tick time (our tick time
is set to 2000).
> Unexpected exception in zookeeper
> ---------------------------------
>
> Key: ZOOKEEPER-3036
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3036
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.4.10
> Environment: 3 Zookeepers, 5 kafka servers
> Reporter: Oded
> Priority: Critical
>
> We got an issue with one of the zookeeprs (Leader), causing the entire kafka
> cluster to fail:
> 2018-05-09 02:29:01,730 [myid:3] - ERROR
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@648] - Unexpected
> exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
> 2018-05-09 02:29:01,730 [myid:3] - WARN
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@661] - ******* GOODBYE
> /192.168.0.91:42490 ********
>
> We would expect that zookeeper will choose another Leader and the Kafka
> cluster will continue to work as expected, but that was not the case.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)