pkuwm commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r445319010



##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,51 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() 
{
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of 
children,
+            // because there is a higher possibility that connection loss is 
caused by other
+            // factors such as network connectivity, connected ZK node could 
not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of 
children.
+              // Infinitely retrying connecting may cause high GC in ZK server 
and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to 
exit retry loop.
+              // TODO: remove this check once we have a better way to exit 
infinite retry
+              Stat stat = getStat(path);
+              if (stat != null) {
+                if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                  LOG.error("Failed to get children for path {} because number 
of children {} "
+                          + "exceeds limit {}, aborting retry.", path, 
stat.getNumChildren(),
+                      NUM_CHILDREN_LIMIT);
+                  // There is not an accurate KeeperException for the purpose.
+                  // MarshallingErrorException could represent transport error,
+                  // so use it to exit retry loop and tell that zk is not able 
to
+                  // transport the data because packet length is too large.
+                  throw new KeeperException.MarshallingErrorException();

Review comment:
       Adding @jiajunwang for extra opinions.
   I understand that and I also thought about it. My take is since this is only 
a workaround not for long term, I would like to just make it simple.
   If later we would like to make it nicer because ZK gives a different error 
code, we could have a sub ZkException like `ZkMarshallingError` (we already 
have this exception defined). 

##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -103,6 +109,13 @@
   // ZkEventThread. Otherwise the retry request might block the normal event 
processing.
   protected final ZkAsyncRetryThread _asyncCallRetryThread;
 
+  static {

Review comment:
       @jiajunwang No, the test will not work, because compilers optimize it 
inline:
   ```
   getChildren() {
     if (stat.getNumChildren() > 100 * 1000) { }
   }
   ```
   
   We would like to avoid that and keep the variable. So the static block is 
used.
   
   In the version 3.4.13 of zk we use, the `packetLen` also declares in this 
way.
   ```
   public static final int packetLen;
   
     static {
       if (LOG.isDebugEnabled()) {
         LOG.debug("zookeeper.disableAutoWatchReset is " + 
disableAutoWatchReset);
       }
   
       packetLen = Integer.getInteger("jute.maxbuffer", 4194304);
     }
   ```
   
   FYI, `packetLen` is removed in newer version of ZK.

##########
File path: 
zookeeper-api/src/test/java/org/apache/helix/zookeeper/impl/client/TestRawZkClient.java
##########
@@ -799,4 +802,66 @@ public void testAsyncWriteOperations() {
       zkClient.delete("/tmp/asyncOversize");
     }
   }
+
+  /*
+   * Tests getChildren() when there are an excessive number of children and 
connection loss happens,
+   * the operation should terminate and exit retry loop.
+   */
+  @Test
+  public void testGetChildrenOnLargeNumChildren() throws Exception {
+    // Default packetLen is 4M. It is static final and initialized

Review comment:
       I don't think it is necessary, because anyway it checks jute.maxbuffer. 
What you mean 10K children nodes are for 4 MB, it is still defined by 
jute.maxbuffer. We adjust this value to smaller one for testing so only 5 
children are needed. Either 100K or 5, it is behaving the same, why would we 
bother to create 10K children, right? :)  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to