[
https://issues.apache.org/jira/browse/ZOOKEEPER-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16817358#comment-16817358
]
Michael K. Edwards commented on ZOOKEEPER-3036:
-----------------------------------------------
We encountered a similar error on the Leader of our production ZooKeeper
cluster (running stock 3.4.13, 7 voting members, 2 observers).
{noformat}
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:16,065
[myid:6] - ERROR [LearnerHandler-/10.3.50.66:39854:LearnerHandler@648] -
Unexpected exception causing shutdown while sock still open
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]:
java.net.SocketTimeoutException: Read timed out
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.socketRead0(Native Method)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:171)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:141)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.io.DataInputStream.readInt(DataInputStream.java:387)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
Apr 12 16:18:16 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:16,065
[myid:6] - WARN [LearnerHandler-/10.3.50.66:39854:LearnerHandler@661] -
******* GOODBYE /10.3.50.66:39854 ********
Apr 12 16:18:17 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:17,079
[myid:6] - INFO [SessionTracker:ZooKeeperServer@355] - Expiring session
0x90000180f040009, timeout of 6000ms exceeded
Apr 12 16:18:17 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:17,080
[myid:6] - INFO [ProcessThread(sid:6 cport:-1)::PrepRequestProcessor@487] -
Processed session termination for sessionid: 0x90000180f040009
Apr 12 16:18:17 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:17,084
[myid:6] - INFO [ProcessThread(sid:6 cport:-1)::PrepRequestProcessor@653] -
Got user-level KeeperException when processing sessionid:0x90000180f040007
type:create cxid:0x1da9 zxid:0x9000126d8 txntype:-1 reqpath:n/a Error
Path:/kafka/controller Error:KeeperErrorCode = NodeExists for /kafka/controller
Apr 12 16:18:19 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:18:19,897
[myid:6] - INFO [ProcessThread(sid:6 cport:-1)::PrepRequestProcessor@653] -
Got user-level KeeperException when processing sessionid:0x800001d10880004
type:delete cxid:0x566b zxid:0x900012dbd txntype:-1 reqpath:n/a Error
Path:/kafka/admin/preferred_replica_election Error:KeeperErrorCode = NoNode for
/kafka/admin/preferred_replica_election
{noformat}
It appears as though this was the first anomalous behavior logged by the
cluster, and triggered this failure at the observer on the other end of the
connection:
{noformat}
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,066
[myid:9] - WARN [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:Observer@77] -
Exception when observing the leader
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: java.io.EOFException
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
java.io.DataInputStream.readInt(DataInputStream.java:392)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:73)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:968)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,066
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:Observer@137] -
shutdown called
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: java.lang.Exception:
shutdown Observer
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.Observer.shutdown(Observer.java:137)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:972)
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,067
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1056] -
Closed socket connection for client /10.3.50.199:55168 which had sessionid
0x90000180f040009
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,067
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1056] -
Closed socket connection for client /10.3.50.111:44036 which had sessionid
0x90000180f040005
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,068
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1056] -
Closed socket connection for client /10.3.50.87:47988 which had sessionid
0x90000180f040007
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,068
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@502]
- shutting down
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,068
[myid:9] - INFO
[QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:ObserverRequestProcessor@121] -
Shutting down
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,068
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184]
- Shutting down
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,068
[myid:9] - INFO
[QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@403] - shutdown
of request processor complete
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,069
[myid:9] - INFO [CommitProcessor:9:CommitProcessor@153] - CommitProcessor
exited loop!
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,071
[myid:9] - INFO [ObserverRequestProcessor:9:ObserverRequestProcessor@105] -
ObserverRequestProcessor exited loop!
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,110
[myid:9] - INFO
[QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - Shutting
down
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,110
[myid:9] - INFO [SyncThread:9:SyncRequestProcessor@186] - SyncRequestProcessor
exited!
{noformat}
However, this seems to have left the observer internally corrupted. It starts
a new election cycle, but receives a response from only 1 out of 7 of the
voting members (not the Leader):
{noformat}
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,111
[myid:9] - INFO [QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:QuorumPeer@909] -
LOOKING
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,112
[myid:9] - INFO
[QuorumPeer[myid=9]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New
election. My id = 9, proposed zxid=0x8000000000000000
Apr 12 16:18:16 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:18:16,121
[myid:9] - INFO [WorkerReceiver[myid=9]:FastLeaderElection@595] -
Notification: 1 (message format version), 6 (n.leader), 0x80002cdb7 (n.zxid),
0x9 (n.round), FOLLOWING (n.state), 7 (n.sid), 0x9 (n.peerEpoch) LOOKING (my
state)
{noformat}
It failed to send to (and receive an ack from) the other six voting members for
15 minutes' worth of retry cycles. Then it dumped a bunch of these (same spew
for each id from 1 to 6, all in the same second):
{noformat}
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:33:47,001
[myid:9] - WARN [RecvWorker:6:QuorumCnxManager$RecvWorker@1025] - Connection
broken for id 6, my id = 9, error =
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: java.net.SocketException:
Connection timed out (Read failed)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.net.SocketInputStream.socketRead0(Native Method)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:171)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:141)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.io.DataInputStream.readInt(DataInputStream.java:387)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:33:47,006
[myid:9] - WARN [RecvWorker:6:QuorumCnxManager$RecvWorker@1028] - Interrupting
SendWorker
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:33:47,007
[myid:9] - WARN [SendWorker:6:QuorumCnxManager$SendWorker@941] - Interrupted
while waiting for message on queue
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]:
java.lang.InterruptedException
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
Apr 12 16:33:47 prod-zk-observer-2 zookeeper[1114]: 2019-04-12 16:33:47,007
[myid:9] - WARN [SendWorker:6:QuorumCnxManager$SendWorker@951] - Send worker
leaving thread
{noformat}
And the really weird thing is, both ends of the connection appear to have
perceived a timeout, rather than one timing out and sending a TCP RST and the
other seeing an EOF. On the Leader side, after the connection finally timed
out, we got a new connection request the next time the Observer's 60-second
timer rolled around, followed by a normal observer-rejoining sequence:
{noformat}
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:47,139
[myid:6] - WARN [RecvWorker:9:QuorumCnxManager$RecvWorker@1025] - Connection
broken for id 9, my id = 6, error =
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: java.net.SocketException:
Connection timed out (Read failed)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.socketRead0(Native Method)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:171)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.net.SocketInputStream.read(SocketInputStream.java:141)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.io.DataInputStream.readInt(DataInputStream.java:387)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:47,139
[myid:6] - WARN [RecvWorker:9:QuorumCnxManager$RecvWorker@1028] - Interrupting
SendWorker
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:47,140
[myid:6] - WARN [SendWorker:9:QuorumCnxManager$SendWorker@941] - Interrupted
while waiting for message on queue
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]:
java.lang.InterruptedException
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: #011at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
Apr 12 16:33:47 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:47,140
[myid:6] - WARN [SendWorker:9:QuorumCnxManager$SendWorker@951] - Send worker
leaving thread
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,373
[myid:6] - INFO [/10.3.50.250:3888:QuorumCnxManager$Listener@743] - Received
connection request /10.3.50.66:35984
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,380
[myid:6] - INFO [LearnerHandler-/10.3.50.66:40180:LearnerHandler@346] -
Follower sid: 9 : info :
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@24db1c2b
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,381
[myid:6] - INFO [LearnerHandler-/10.3.50.66:40180:LearnerHandler@401] -
Synchronizing with Follower sid: 9 maxCommittedLog=0x9000134b2
minCommittedLog=0x9000132be peerLastZxid=0x9000126d5
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,381
[myid:6] - WARN [LearnerHandler-/10.3.50.66:40180:LearnerHandler@468] -
Unhandled proposal scenario
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,381
[myid:6] - INFO [LearnerHandler-/10.3.50.66:40180:LearnerHandler@475] -
Sending SNAP
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,381
[myid:6] - INFO [LearnerHandler-/10.3.50.66:40180:LearnerHandler@499] -
Sending snapshot last zxid of peer is 0x9000126d5 zxid of leader is
0x9000134b2sent zxid of db as 0x9000134b2
Apr 12 16:33:58 prod-zk-voter-aza2 zookeeper[1129]: 2019-04-12 16:33:58,519
[myid:6] - INFO [LearnerHandler-/10.3.50.66:40180:LearnerHandler@535] -
Received NEWLEADER-ACK message from 9
{noformat}
I have no idea where it would get a 900-second timeout from. If I understand
the flow correctly, the timeout should have been 4 seconds (syncLimit = 2,
times tickTime = 2000 ms).
{noformat}
clientPort=2181
dataDir=/zookeeper
tickTime=2000
initLimit=5
syncLimit=2
peerType=observer
server.1=10.3.52.83:2888:3888
server.2=10.3.52.245:2888:3888
server.3=10.3.51.102:2888:3888
server.4=10.3.51.53:2888:3888
server.5=10.3.50.133:2888:3888
server.6=10.3.50.250:2888:3888
server.7=10.3.50.62:2888:3888
server.8=10.3.50.128:2888:3888:observer
server.9=10.3.50.66:2888:3888:observer
{noformat}
> Unexpected exception in zookeeper
> ---------------------------------
>
> Key: ZOOKEEPER-3036
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3036
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.4.10
> Environment: 3 Zookeepers, 5 kafka servers
> Reporter: Oded
> Priority: Critical
>
> We got an issue with one of the zookeeprs (Leader), causing the entire kafka
> cluster to fail:
> 2018-05-09 02:29:01,730 [myid:3] - ERROR
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@648] - Unexpected
> exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
> 2018-05-09 02:29:01,730 [myid:3] - WARN
> [LearnerHandler-/192.168.0.91:42490:LearnerHandler@661] - ******* GOODBYE
> /192.168.0.91:42490 ********
>
> We would expect that zookeeper will choose another Leader and the Kafka
> cluster will continue to work as expected, but that was not the case.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)