pkuwm opened a new issue #962:
URL: https://github.com/apache/helix/issues/962


   **Problem**
   For ZkClient's getChildren() operation, if there are a large number of 
children and the response packet size exceeds `jute.maxbuffer` default value 
4MB on zk client side, ZkClient will get a `ConnectionLossException` and keep 
retrying connecting to ZK. The consequence is, the infinite retry may cause 
heavy GC on ZK server and kill ZK server.
   
   **Related Logs**
   zkCli to reproduce the exception
   ```
   Exception in thread "main" 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /CLUSTER-1/CONFIGS/RESOURCE
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1532)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1560)
        at 
org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:731)
        at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:599)
        at 
org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:371)
        at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:331)
   ```
   
   In application logs:
   ```
   2020/03/03 17:55:19.979 WARN [ClientCnxn] 
[HelixTaskExecutor-message_handle_thread-SendThread(localhost:2181)] [helix] [] 
Session 0x270a12ba21102df for server localhost/127.0.0.1:2181, unexpected 
error, closing socket connection and attempting reconnect
   java.io.IOException: Packet len4198500 is out of range!
           at 
org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:113)
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
   
   ```
   
   **Related Code in Helix ZkClient**
   
https://github.com/apache/helix/blob/108dfc6c9b48a89b1c9804f0c2b77c2572d623a8/zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java#L1449-L1481
   
   **Potential Solutions**
   
   - Catch the particular error `Packet len4198500 is out of range` in ZkClient 
and stop retrying. However, the particular error is not thrown to ZkClient, 
instead, only a general `ConnectionLossException` is thrown, so native 
zookeeper code has to be changed to throw such error for ZkClient to catch.
   - Add retry loop policy to ZkClient so retrying connecting could be limited 
by time or number of retries.
   - More...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to