[
https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402819#comment-15402819
]
Artem Golotin edited comment on ZOOKEEPER-1863 at 8/1/16 10:26 PM:
-------------------------------------------------------------------
Hi! I was able to reproduce the same issue as mentioned in
[ZOOKEEPER-2151|https://issues.apache.org/jira/browse/ZOOKEEPER-2151]. Here is
the output from the stat command:
{noformat}
Zookeeper version: 3.5.1-alpha--1, built on 07/08/2016 17:08 GMT
Clients:
/0:0:0:0:0:0:0:1:59312[0](queued=0,recved=1,sent=0)
/10.2.29.7:34266[0](queued=0,recved=1,sent=0)
/10.2.29.7:55996[0](queued=0,recved=1,sent=0)
/10.2.81.3:48640[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/1/826
Received: 2199007
Sent: 935511
Connections: 4
Outstanding: 48112
Zxid: 0x100011d4e
Mode: leader
Node count: 171
{noformat}
Notice the number of outstanding connections....
After examining the code I found that the patch that have been submitted with
this issue was applied to the code. With my limited experience I do not believe
that I will be able to reproduce the issue, but I am saving the full stdout
zookeeper output and can provide it if needed.
Here's jstack dump:
{noformat}
Full thread dump OpenJDK 64-Bit Server VM (24.95-b01 mixed mode):
"NIOWorkerThread-4" daemon prio=10 tid=0x00007fbb9c007000 nid=0x1a7 waiting on
condition [0x00007fbbd84ce000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c18ed460> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"NIOWorkerThread-3" daemon prio=10 tid=0x00007fbb9c005000 nid=0x1a6 waiting on
condition [0x00007fbbd85cf000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c18ed460> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"NIOWorkerThread-2" daemon prio=10 tid=0x00007fbb9c003800 nid=0x1a4 waiting on
condition [0x00007fbbd87d1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c18ed460> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"ContainerManagerTask" daemon prio=10 tid=0x00007fbbb801b800 nid=0x1a3 in
Object.wait() [0x00007fbbd88d2000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at java.util.TimerThread.mainLoop(Timer.java:526)
- locked <0x00000000c190fa68> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:505)
"SyncThread:1" prio=10 tid=0x00007fbbb8016000 nid=0x1a1 in Object.wait()
[0x00007fbbd8ad4000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x00000000c195c128> (a
org.apache.zookeeper.server.SyncRequestProcessor)
at java.lang.Thread.join(Thread.java:1355)
at
org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:195)
at
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(ProposalRequestProcessor.java:90)
at
org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:954)
at
org.apache.zookeeper.server.quorum.LeaderRequestProcessor.shutdown(LeaderRequestProcessor.java:78)
at
org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:476)
- locked <0x00000000c18ebd28> (a
org.apache.zookeeper.server.quorum.LeaderZooKeeperServer)
at
org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:102)
- locked <0x00000000c18ebd28> (a
org.apache.zookeeper.server.quorum.LeaderZooKeeperServer)
at
org.apache.zookeeper.server.ZooKeeperServer$ZooKeeperServerListenerImpl.notifyStopping(ZooKeeperServer.java:443)
at
org.apache.zookeeper.server.ZooKeeperCriticalThread.handleException(ZooKeeperCriticalThread.java:49)
at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:165)
"NIOWorkerThread-1" daemon prio=10 tid=0x00007fbb9c002000 nid=0x1a0 waiting on
condition [0x00007fbbd8bd5000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c18ed460> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
"LearnerCnxAcceptor-/10.2.88.8:2888" prio=10 tid=0x00007fbbb8007000 nid=0x19d
runnable [0x00007fbbd8ed8000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at
org.apache.zookeeper.server.quorum.Leader$LearnerCnxAcceptor.run(Leader.java:364)
"QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled)" prio=10
tid=0x00007fbbe02fd800 nid=0x19c in Object.wait() [0x00007fbbd8fd9000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559)
- locked <0x00000000c18ebe20> (a
org.apache.zookeeper.server.quorum.Leader)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036)
"WorkerReceiver[myid=1]" daemon prio=10 tid=0x00007fbbe02fb800 nid=0x19b
waiting on condition [0x00007fbbd90da000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c1820688> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:1038)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:235)
at java.lang.Thread.run(Thread.java:745)
"WorkerSender[myid=1]" daemon prio=10 tid=0x00007fbbe02fa800 nid=0x19a waiting
on condition [0x00007fbbd91db000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c1820e98> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:457)
at java.lang.Thread.run(Thread.java:745)
"/10.2.88.8:3888" prio=10 tid=0x00007fbbe02e9000 nid=0x199 runnable
[0x00007fbbd92dc000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:635)
"2029193347@qtp-381175311-1 - Acceptor0 [email protected]:8080" prio=10
tid=0x00007fbbe02d9000 nid=0x198 runnable [0x00007fbbd93dd000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99)
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
"1451939056@qtp-381175311-0" prio=10 tid=0x00007fbbe02dd800 nid=0x197 in
Object.wait() [0x00007fbbd94de000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626)
- locked <0x00000000c1821e40> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
"ConnnectionExpirer" prio=10 tid=0x00007fbbe02d7800 nid=0x196 waiting on
condition [0x00007fbbd95df000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$ConnectionExpirerThread.run(NIOServerCnxnFactory.java:574)
"NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181" daemon prio=10
tid=0x00007fbbe02d6000 nid=0x195 runnable [0x00007fbbd96e0000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000c19609b0> (a sun.nio.ch.Util$2)
- locked <0x00000000c19609a0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000c19607a8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$AcceptThread.select(NIOServerCnxnFactory.java:229)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$AcceptThread.run(NIOServerCnxnFactory.java:205)
"NIOServerCxnFactory.SelectorThread-0" daemon prio=10 tid=0x00007fbbe02d5800
nid=0x194 runnable [0x00007fbbd97e1000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000c19479e8> (a sun.nio.ch.Util$2)
- locked <0x00000000c19479d8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000c1947350> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$SelectorThread.select(NIOServerCnxnFactory.java:426)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$SelectorThread.run(NIOServerCnxnFactory.java:391)
"PurgeTask" daemon prio=10 tid=0x00007fbbe024d000 nid=0x193 in Object.wait()
[0x00007fbbd98e2000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.util.TimerThread.mainLoop(Timer.java:552)
- locked <0x00000000c1822748> (a java.util.TaskQueue)
at java.util.TimerThread.run(Timer.java:505)
"RMI TCP Accept-0" daemon prio=10 tid=0x00007fbbe01ed000 nid=0x191 runnable
[0x00007fbbe4168000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:399)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:371)
at java.lang.Thread.run(Thread.java:745)
"Service Thread" daemon prio=10 tid=0x00007fbbe00ab000 nid=0x190 runnable
[0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x00007fbbe00a9000 nid=0x18f waiting on
condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x00007fbbe00a6000 nid=0x18e waiting on
condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00007fbbe00a3800 nid=0x18d waiting on
condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00007fbbe0078000 nid=0x18c in Object.wait()
[0x00007fbbe4ea8000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000000c1861290> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
"Reference Handler" daemon prio=10 tid=0x00007fbbe0076000 nid=0x18b in
Object.wait() [0x00007fbbe4fa9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000000c1860ec0> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x00007fbbe000a800 nid=0x187 in Object.wait()
[0x00007fbbe9a1a000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000c18ebec0> (a
org.apache.zookeeper.server.quorum.QuorumPeer)
at java.lang.Thread.join(Thread.java:1281)
- locked <0x00000000c18ebec0> (a
org.apache.zookeeper.server.quorum.QuorumPeer)
at java.lang.Thread.join(Thread.java:1355)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:184)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:120)
at
org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
"VM Thread" prio=10 tid=0x00007fbbe0071800 nid=0x18a runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fbbe0020800 nid=0x188
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fbbe0022800 nid=0x189
runnable
"VM Periodic Task Thread" prio=10 tid=0x00007fbbe01fa800 nid=0x192 waiting on
condition
JNI global references: 149
Heap
PSYoungGen total 26112K, used 8888K [0x00000000eb280000,
0x00000000ed880000, 0x0000000100000000)
eden space 13312K, 45% used
[0x00000000eb280000,0x00000000eb86e050,0x00000000ebf80000)
from space 12800K, 22% used
[0x00000000ecc00000,0x00000000ecec0000,0x00000000ed880000)
to space 12800K, 0% used
[0x00000000ebf80000,0x00000000ebf80000,0x00000000ecc00000)
ParOldGen total 278528K, used 266898K [0x00000000c1800000,
0x00000000d2800000, 0x00000000eb280000)
object space 278528K, 95% used
[0x00000000c1800000,0x00000000d1ca48e8,0x00000000d2800000)
PSPermGen total 21504K, used 14831K [0x00000000b7200000,
0x00000000b8700000, 0x00000000c1800000)
object space 21504K, 68% used
[0x00000000b7200000,0x00000000b807bd50,0x00000000b8700000)
{noformat}
Unfortunately, the issue seems to have happened on Sunday when I was not able
to check the health of my cluster or perform any immediate fixes. I do have
several Kazoo clients + Kafka + Secor that are using zookeeper, and as all
those are failing now with zookeeper outputting the following logs:
{noformat}
2016-08-01 20:45:51,828 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50656
2016-08-01 20:45:51,829 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50656
2016-08-01 20:45:53,723 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35862
2016-08-01 20:45:53,725 - INFO [NIOWorkerThread-3:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35862
2016-08-01 20:45:59,171 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50658
2016-08-01 20:45:59,172 - INFO [NIOWorkerThread-1:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50658
2016-08-01 20:46:00,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35828 which had sessionid
0x10044f682c45bdd
2016-08-01 20:46:00,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50656 which had sessionid
0x10044f682c45bdf
2016-08-01 20:46:05,013 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35898
2016-08-01 20:46:05,028 - INFO [NIOWorkerThread-2:NIOServerCnxn@836] -
Processing ruok command from /10.2.29.7:35898
2016-08-01 20:46:05,029 - INFO [NIOWorkerThread-2:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35898 (no session established for
client)
2016-08-01 20:46:05,146 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO [NIOWorkerThread-4:NIOServerCnxn@836] -
Processing srvr command from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO [NIOWorkerThread-4:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35900 (no session established for
client)
2016-08-01 20:46:05,617 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35904
2016-08-01 20:46:05,617 - INFO [NIOWorkerThread-3:NIOServerCnxn@836] -
Processing ruok command from /10.2.29.7:35904
2016-08-01 20:46:05,618 - INFO [NIOWorkerThread-3:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35904 (no session established for
client)
2016-08-01 20:46:05,634 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35906
2016-08-01 20:46:05,634 - INFO [NIOWorkerThread-1:NIOServerCnxn@836] -
Processing srvr command from /10.2.29.7:35906
2016-08-01 20:46:05,636 - INFO [NIOWorkerThread-1:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35906 (no session established for
client)
2016-08-01 20:46:05,644 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35908
2016-08-01 20:46:05,644 - INFO [NIOWorkerThread-2:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35908
2016-08-01 20:46:06,718 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50660
2016-08-01 20:46:06,719 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50660
2016-08-01 20:46:10,224 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50658 which had sessionid
0x10044f682c45be1
2016-08-01 20:46:10,225 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35862 which had sessionid
0x10044f682c45be0
2016-08-01 20:46:13,933 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50662
2016-08-01 20:46:13,933 - INFO [NIOWorkerThread-3:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50662
2016-08-01 20:46:16,030 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35942
2016-08-01 20:46:16,032 - INFO [NIOWorkerThread-1:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35942
2016-08-01 20:46:18,646 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:57696
2016-08-01 20:46:18,647 - INFO [NIOWorkerThread-2:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:57696
2016-08-01 20:46:20,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35908 which had sessionid
0x10044f682c45be2
2016-08-01 20:46:20,224 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50662 which had sessionid
0x10044f682c45be4
2016-08-01 20:46:20,225 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50660 which had sessionid
0x10044f682c45be3
2016-08-01 20:46:21,255 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50664
2016-08-01 20:46:21,256 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50664
{noformat}
Also, I would like to point out that I am running zookeeper as a part of
kubernetes cluster, so the issue may be caused by kubernetes internal proxy,
but I do not know why that would cause an issue...
was (Author: agolotin):
Hi! I was able to reproduce the same issue as mentioned in
[ZOOKEEPER-2151|https://issues.apache.org/jira/browse/ZOOKEEPER-2151]. Here is
the output from the stat command:
{noformat}
Zookeeper version: 3.5.1-alpha--1, built on 07/08/2016 17:08 GMT
Clients:
/0:0:0:0:0:0:0:1:59312[0](queued=0,recved=1,sent=0)
/10.2.29.7:34266[0](queued=0,recved=1,sent=0)
/10.2.29.7:55996[0](queued=0,recved=1,sent=0)
/10.2.81.3:48640[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/1/826
Received: 2199007
Sent: 935511
Connections: 4
Outstanding: 48112
Zxid: 0x100011d4e
Mode: leader
Node count: 171
{noformat}
Notice the number of outstanding connections....
After examining the code I found that the patch that have been submitted with
this issue was applied to the code. With my limited experience I do not believe
that I will be able to reproduce the issue, but I am saving the full stdout
zookeeper output and can provide it if needed. However, I cannot provide you
with a jstack dump, because when I try to do so (even with -F option), it
generates the following exception:
{noformat}
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't
attach to the process
{noformat}
Unfortunately, the issue seems to have happened on Sunday when I was not able
to check the health of my cluster or perform any immediate fixes. I do have
several Kazoo clients + Kafka + Secor that are using zookeeper, and as all
those are failing now with zookeeper outputting the following logs:
{noformat}
2016-08-01 20:45:51,828 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50656
2016-08-01 20:45:51,829 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50656
2016-08-01 20:45:53,723 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35862
2016-08-01 20:45:53,725 - INFO [NIOWorkerThread-3:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35862
2016-08-01 20:45:59,171 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50658
2016-08-01 20:45:59,172 - INFO [NIOWorkerThread-1:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50658
2016-08-01 20:46:00,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35828 which had sessionid
0x10044f682c45bdd
2016-08-01 20:46:00,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50656 which had sessionid
0x10044f682c45bdf
2016-08-01 20:46:05,013 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35898
2016-08-01 20:46:05,028 - INFO [NIOWorkerThread-2:NIOServerCnxn@836] -
Processing ruok command from /10.2.29.7:35898
2016-08-01 20:46:05,029 - INFO [NIOWorkerThread-2:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35898 (no session established for
client)
2016-08-01 20:46:05,146 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO [NIOWorkerThread-4:NIOServerCnxn@836] -
Processing srvr command from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO [NIOWorkerThread-4:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35900 (no session established for
client)
2016-08-01 20:46:05,617 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35904
2016-08-01 20:46:05,617 - INFO [NIOWorkerThread-3:NIOServerCnxn@836] -
Processing ruok command from /10.2.29.7:35904
2016-08-01 20:46:05,618 - INFO [NIOWorkerThread-3:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35904 (no session established for
client)
2016-08-01 20:46:05,634 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35906
2016-08-01 20:46:05,634 - INFO [NIOWorkerThread-1:NIOServerCnxn@836] -
Processing srvr command from /10.2.29.7:35906
2016-08-01 20:46:05,636 - INFO [NIOWorkerThread-1:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35906 (no session established for
client)
2016-08-01 20:46:05,644 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35908
2016-08-01 20:46:05,644 - INFO [NIOWorkerThread-2:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35908
2016-08-01 20:46:06,718 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50660
2016-08-01 20:46:06,719 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50660
2016-08-01 20:46:10,224 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50658 which had sessionid
0x10044f682c45be1
2016-08-01 20:46:10,225 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35862 which had sessionid
0x10044f682c45be0
2016-08-01 20:46:13,933 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50662
2016-08-01 20:46:13,933 - INFO [NIOWorkerThread-3:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50662
2016-08-01 20:46:16,030 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35942
2016-08-01 20:46:16,032 - INFO [NIOWorkerThread-1:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35942
2016-08-01 20:46:18,646 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:57696
2016-08-01 20:46:18,647 - INFO [NIOWorkerThread-2:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:57696
2016-08-01 20:46:20,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35908 which had sessionid
0x10044f682c45be2
2016-08-01 20:46:20,224 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50662 which had sessionid
0x10044f682c45be4
2016-08-01 20:46:20,225 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50660 which had sessionid
0x10044f682c45be3
2016-08-01 20:46:21,255 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50664
2016-08-01 20:46:21,256 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50664
{noformat}
Also, I would like to point out that I am running zookeeper as a part of
kubernetes cluster, so the issue may be caused by kubernetes internal proxy,
but I do not know why that would cause an issue...
> Race condition in commit processor leading to out of order request
> completion, xid mismatch on client.
> ------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1863
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0
> Reporter: Dutch T. Meyer
> Assignee: Dutch T. Meyer
> Priority: Blocker
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch,
> ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch,
> ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, stack.17512
>
>
> In CommitProcessor.java processor, if we are at the primary request handler
> on line 167:
> {noformat}
> while (!stopped && !isWaitingForCommit() &&
> !isProcessingCommit() &&
> (request = queuedRequests.poll()) != null) {
> if (needCommit(request)) {
> nextPending.set(request);
> } else {
> sendToNextProcessor(request);
> }
> }
> {noformat}
> A request can be handled in this block and be quickly processed and completed
> on another thread. If queuedRequests is empty, we then exit the block. Next,
> before this thread makes any more progress, we can get 2 more requests, one
> get_children(say), and a sync placed on queuedRequests for the processor.
> Then, if we are very unlucky, the sync request can complete and this object's
> commit() routine is called (from FollowerZookeeperServer), which places the
> sync request on the previously empty committedRequests queue. At that point,
> this thread continues.
> We reach line 182, which is a check on sync requests.
> {noformat}
> if (!stopped && !isProcessingRequest() &&
> (request = committedRequests.poll()) != null) {
> {noformat}
> Here we are not processing any requests, because the original request has
> completed. We haven't dequeued either the read or the sync request in this
> processor. Next, the poll above will pull the sync request off the queue, and
> in the following block, the sync will get forwarded to the next processor.
> This is a problem because the read request hasn't been forwarded yet, so
> requests are now out of order.
> I've been able to reproduce this bug reliably by injecting a
> Thread.sleep(5000) between the two blocks above to make the race condition
> far more likely, then in a client program.
> {noformat}
> zoo_aget_children(zh, "/", 0, getchildren_cb, NULL);
> //Wait long enough for queuedRequests to drain
> sleep(1);
> zoo_aget_children(zh, "/", 0, getchildren_cb, &th_ctx[0]);
> zoo_async(zh, "/", sync_cb, &th_ctx[0]);
> {noformat}
> When this bug is triggered, 3 things can happen:
> 1) Clients will see requests complete out of order and fail on xid mismatches.
> 2) Kazoo in particular doesn't handle this runtime exception well, and can
> orphan outstanding requests.
> 3) I've seen zookeeper servers deadlock, likely because the commit cannot be
> completed, which can wedge the commit processor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)