pkuwm opened a new issue #962:
URL: https://github.com/apache/helix/issues/962
**Problem**
For ZkClient's getChildren() operation, if there are a large number of
children and the response packet size exceeds `jute.maxbuffer` default value
4MB on zk client side, ZkClient will get a `ConnectionLossException` and keep
retrying connecting to ZK. The consequence is, the infinite retry may cause
heavy GC on ZK server and kill ZK server.
**Related Logs**
zkCli to reproduce the exception
```
Exception in thread "main"
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode =
ConnectionLoss for /CLUSTER-1/CONFIGS/RESOURCE
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1532)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1560)
at
org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:731)
at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:599)
at
org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:371)
at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:331)
```
In application logs:
```
2020/03/03 17:55:19.979 WARN [ClientCnxn]
[HelixTaskExecutor-message_handle_thread-SendThread(localhost:2181)] [helix] []
Session 0x270a12ba21102df for server localhost/127.0.0.1:2181, unexpected
error, closing socket connection and attempting reconnect
java.io.IOException: Packet len4198500 is out of range!
at
org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
```
**Related Code in Helix ZkClient**
https://github.com/apache/helix/blob/108dfc6c9b48a89b1c9804f0c2b77c2572d623a8/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java#L1449-L1481
**Potential Solutions**
- Catch the particular error `Packet len4198500 is out of range` in ZkClient
and stop retrying. However, the particular error is not thrown to ZkClient,
instead, only a general `ConnectionLossException` is thrown, so native
zookeeper code has to be changed to throw such error for ZkClient to catch.
- Add retry loop policy to ZkClient so retrying connecting could be limited
by time or number of retries.
- More...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]