Hi Developers,
I found a ZooKeeper related OOME when deploying a Spring Boot + Dubbo +
ZooKeeper JavaEE application, it occur rarely. I got a heap.bin and analyzed
with Eclipse Memory Analyzer, there's almost 36 millions nodes of
ClientCnxn$EventThread.waitingEvents, nodes' item are the same eventOfDeath
object.
In ClientCnxn.EventThread, there is:
private Object eventOfDeath = new Object();
public void queueEventOfDeath() {
waitingEvents.add(eventOfDeath);
}
queueEventOfDeath() is the only way to add eventOfDeath, it seems
queueEventOfDeath() was still invoked when wasKilled is true and the addition
is not filtered.
ZooKeeper version is 3.4.12, both server and client.
The OOME application is a high level service, and there's a low level service
application which is a dependency of high level service, they are all single
deployment, no duplicated one. They were deployed almost at the same time.
I searched Jira issues and not found the same one, any helps?
Attachments:
==
Some MAT result:
1) From dominator_tree:
Class Name
| Shallow Heap | Retained Heap | Percentage
-------------------------------------------------------------------------------------------------------------------------------------------
org.apache.curator.CuratorZookeeperClient @ 0xc4312678
| 40 | 868,942,136 | 88.54%
|- org.apache.curator.ConnectionState @ 0xc43126a0
| 64 | 868,941,928 | 88.54%
| |- org.apache.curator.HandleHolder @ 0xc43126e0
| 40 | 868,941,080 | 88.54%
| | |- org.apache.curator.HandleHolder$1$1 @ 0xc82cfc60
| 16 | 868,941,016 | 88.54%
| | | '- org.apache.curator.HandleHolder$1 @ 0xc82cfc70
| 24 | 868,941,000 | 88.54%
| | | '- org.apache.zookeeper.ZooKeeper @ 0xc82cfc88
| 24 | 868,940,976 | 88.54%
| | | |- org.apache.zookeeper.ClientCnxn @ 0xc82cfca0
| 104 | 868,940,680 | 88.54%
| | | | |- org.apache.zookeeper.ClientCnxn$EventThread @ 0xc82d2800
main-EventThread| 144 | 868,872,040 | 88.53%
| | | | | |- java.util.concurrent.LinkedBlockingQueue @ 0xc82d29e0
| 48 | 868,871,816 | 88.53%
-------------------------------------------------------------------------------------------------------------------------------------------
2) ClientCnxn$EventThread -> list objects -> outgoing:
Class Name
| Shallow Heap | Retained Heap
-----------------------------------------------------------------------------------------------------------------------------------------
org.apache.zookeeper.ClientCnxn$EventThread @ 0xc82d2800 main-EventThread
| 144 | 868,872,040
|- waitingEvents java.util.concurrent.LinkedBlockingQueue @ 0xc82d29e0
| 48 | 868,871,816
| |- count java.util.concurrent.atomic.AtomicInteger @ 0xc82d2a10 36202972
| 16 | 16
| |- head java.util.concurrent.LinkedBlockingQueue$Node @ 0xc82d2a20
| 24 | 868,871,552
| | |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0xc82d2a38
| 24 | 868,871,528
| | | |- item org.apache.zookeeper.ClientCnxn$WatcherSetEventPair @
0xc82d2a50 | 24 | 224
| | | |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0xc82d2b30
| 24 | 868,871,280
| | | | |- item java.lang.Object @ 0xc82d2b48
| 16 | 16
| | | | |- next java.util.concurrent.LinkedBlockingQueue$Node @ 0xc82d2b58
| 24 | 868,871,256
| | | | | |- item java.lang.Object @ 0xc82d2b48
| 16 | 16
| | | | | |- next java.util.concurrent.LinkedBlockingQueue$Node @
0xc82d2b70 | 24 | 868,871,232
| | | | | | |- item java.lang.Object @ 0xc82d2b48
| 16 | 16
| | | | | | |- next java.util.concurrent.LinkedBlockingQueue$Node @
0xc82d2b88 | 24 | 868,871,208
-----------------------------------------------------------------------------------------------------------------------------------------
3) Some ClientCnxn instance fields:
Class Name
| Shallow Heap | Retained Heap
------------------------------------------------------------------------------------------------------------------------------------------
org.apache.zookeeper.ClientCnxn @ 0xc82cfca0
| 104 | 868,940,680
|- <class> class org.apache.zookeeper.ClientCnxn @ 0xc44deaa8
| 24 | 128
|- eventThread org.apache.zookeeper.ClientCnxn$EventThread @ 0xc82d2800
main-EventThread | 144 | 868,872,040
|- eventOfDeath java.lang.Object @ 0xc82d2b48
| 16 | 16
------------------------------------------------------------------------------------------------------------------------------------------
4) Calculate eventOfDeath referenced count:
SELECT n.item, n.item.@objectId, n.item.hashCode() FROM
java.util.concurrent.LinkedBlockingQueue$Node n
n.item | n.item.@objectId | n.item.hashCode()
-----------------------------------------------------------------------
java.lang.Object [id=0xc82d2b48]| 2,516,148 | 2,516,148
-----------------------------------------------------------------------
SELECT n.item FROM java.util.concurrent.LinkedBlockingQueue$Node n WHERE
(n.item.@objectId = 2516148)
n.item
--------------------------------
java.lang.Object [id=0xc82d2b48]
--------------------------------
Total: 20 of 36,201,971 entries
==
Some of biz machine log:
2019-08-07 16:04:22 [WARN] [.154:2181)] [o.a.z.ClientCnxn][1111]] Client
session timed out, have not heard from server in 69261ms for sessionid
0x100000048ef0156
2019-08-07 16:04:22 [INFO] [.154:2181)] [o.a.z.ClientCnxn][1159]] Client
session timed out, have not heard from server in 69261ms for sessionid
0x100000048ef0156, closing socket connection and attempting reconnect
2019-08-07 16:04:33 [INFO] [.154:2181)] [o.a.z.ClientCnxn][1028]] Opening
socket connection to server 10.111.196.154/10.111.196.154:2181. Will not
attempt to authenticate using SASL (unknown error)
2019-08-07 16:04:33 [INFO] [.154:2181)] [o.a.z.ClientCnxn][878]] Socket
connection established to 10.111.196.154/10.111.196.154:2181, initiating session
2019-08-07 16:04:39 [WARN] [.154:2181)] [o.a.z.ClientCnxn][1288]] Unable to
reconnect to ZooKeeper service, session 0x100000048ef0156 has expired
2019-08-07 16:04:39 [INFO] [.154:2181)] [o.a.z.ClientCnxn][1157]] Unable to
reconnect to ZooKeeper service, session 0x100000048ef0156 has expired, closing
socket connection
...
2019-08-07 16:04:44 [ERROR] [ventThread] [o.a.c.f.i.CuratorFrameworkImpl][685]]
Watcher exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode =
Session expired for /dubbo/mycom.auth.api.UserAccountService/providers
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1653)
at
org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242)
at
org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231)
at
org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at
org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228)
at
org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219)
at
org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41)
at
com.alibaba.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient$CuratorWatcherImpl.process(CuratorZookeeperClient.java:159)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
...
2019-08-07 16:04:49 [ERROR] [ventThread] [o.a.z.ClientCnxn][534]] Error while
calling watcher
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at
com.alibaba.dubbo.registry.support.AbstractRegistry.notify(AbstractRegistry.java:380)
at
com.alibaba.dubbo.registry.support.FailbackRegistry.doNotify(FailbackRegistry.java:273)
at
com.alibaba.dubbo.registry.support.FailbackRegistry.notify(FailbackRegistry.java:259)
at
com.alibaba.dubbo.registry.zookeeper.ZookeeperRegistry.access$400(ZookeeperRegistry.java:43)
at
com.alibaba.dubbo.registry.zookeeper.ZookeeperRegistry$3.childChanged(ZookeeperRegistry.java:171)
at
com.alibaba.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient$CuratorWatcherImpl.process(CuratorZookeeperClient.java:159)
at
org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
==
Some of zookeeper server log (/zookeeper.out)
2019-08-07 16:05:21,434 [myid:] - INFO [ProcessThread(sid:0
cport:2181)::PrepRequestProcessor@653] - Got user-level KeeperException when
processing sessionid:0x100000048ef015a type:create cxid:0x1a zxid:0x14fa1
txntype:-1 reqpath:n/a Error
Path:/dubbo/mycom.auth.api.EmailInfoService/routers Error:KeeperErrorCode =
NodeExists for /dubbo/mycom.auth.api.EmailInfoService/routers
2019-08-07 16:05:31,024 [myid:] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read
additional data from client sessionid 0x100000048ef015a, likely client has
closed socket
2019-08-07 16:05:31,025 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket
connection for client /10.111.196.155:45044 which had sessionid
0x100000048ef015a
2019-08-07 16:05:49,251 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted
socket connection from /10.111.196.155:45150
2019-08-07 16:05:49,262 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@948] - Client
attempting to establish new session at /10.111.196.155:45150
2019-08-07 16:05:49,264 [myid:] - INFO [SyncThread:0:ZooKeeperServer@693] -
Established session 0x100000048ef015b with negotiated timeout 60000 for client
/10.111.196.155:45150
2019-08-07 16:05:49,377 [myid:] - INFO [ProcessThread(sid:0
cport:2181)::PrepRequestProcessor@653] - Got user-level KeeperException when
processing sessionid:0x100000048ef015b type:create cxid:0x4 zxid:0x14fa4
txntype:-1 reqpath:n/a Error
Path:/dubbo/mycom.auth.api.UserAccountService/providers Error:KeeperErrorCode =
NodeExists for /dubbo/mycom.auth.api.UserAccountService/providers
2019-08-07 16:05:49,397 [myid:] - INFO [ProcessThread(sid:0
cport:2181)::PrepRequestProcessor@653] - Got user-level KeeperException when
processing sessionid:0x100000048ef015b type:create cxid:0x7 zxid:0x14fa5
txntype:-1 reqpath:n/a Error
Path:/dubbo/mycom.auth.api.UserAccountService/configurators
Error:KeeperErrorCode = NodeExists for
/dubbo/mycom.auth.api.UserAccountService/configurators