pkuwm commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r443816414



##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +1000,43 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+    // Need one element array to change value of this final variable.
+    final int[] connectionLossRetryCount = {0};
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() 
{
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount[0];
+            // Allow retrying 3 times before checking stat checking number of 
children,
+            // because there is a higher possibility that connection loss is 
caused by other
+            // factors such as network connectivity, connected ZK node could 
not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount[0] >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of 
children.
+              // Infinitely retrying connecting may cause high GC in ZK server 
and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to 
exit retry loop.
+              // TODO: remove this check once we have a better way to exit 
infinite retry
+              Stat stat = getStat(path);
+              if (stat != null && stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                LOG.error("Failed to get children for path {} because number 
of children {} "
+                        + "exceeds limit {}, aborting retry.", path, 
stat.getNumChildren(),
+                    NUM_CHILDREN_LIMIT);
+                // There is not an accurate KeeperException for the purpose.
+                // MarshallingErrorException could represent transport error,
+                // so use it to exit retry loop and tell that zk is not able to
+                // transport the data because packet length is too large.
+                throw new KeeperException.MarshallingErrorException();

Review comment:
       @lei-xia `KeeperException` is an abstract class and it is not flexible 
for us to extend to have a new exception code. Say if we create a new exception 
type 
   ```
   PacketOutOfLenException extends KeeperException {
      PacketOutOfLenException(code) {
         super(code);
      }
   }
   ```
   If we use a different code other than existing defined code, KeeperException 
won't be able to recognize it and will throw `IllegalArgumentException(Invalid 
exception code)`. I don't think this is what we need.
   
   Comparing all the exception types in KeeperException, I think 
`MarshallingErrorException` is the closest one that represents the failure 
reason: failed at transport layer.
   
   When ZkClient throws `MarshallingErrorException`, we'll have error log 
`"Failed to get children for path {} because number of children {} exceeds 
limit {}, aborting retry."` and know the error is because of large children 
listing.
   
   What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to