jiajunwang commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r447317694



##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -1503,12 +1555,14 @@ private void acquireEventLock() {
         } catch (Exception e) {
           throw ExceptionUtil.convertToRuntimeException(e);
         }
+
         // before attempting a retry, check whether retry timeout has elapsed
         if (System.currentTimeMillis() - operationStartTime > 
_operationRetryTimeoutInMillis) {
           throw new ZkTimeoutException("Operation cannot be retried because of 
retry timeout ("
               + _operationRetryTimeoutInMillis + " milli seconds). Retry was 
caused by "
               + retryCauseCode);
         }
+        LOG.warn("Retrying operation, caused by {}", retryCauseCode);

Review comment:
       I think we removed this because many clients complaining that this retry 
log floods their log file. Please make it debug or don't add it.

##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,50 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() 
{
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of 
children,
+            // because there is a higher possibility that connection loss is 
caused by other
+            // factors such as network connectivity, connected ZK node could 
not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of 
children.
+              // Infinitely retrying connecting may cause high GC in ZK server 
and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to 
exit retry loop.
+              // TODO: remove this check once we have a better way to exit 
infinite retry
+              Stat stat = getStat(path);

Review comment:
       This is a recursive retryUntilConnect call inside. I think it shall 
work. But it would be better if we can avoid this recursive call. Basically 
what you need to do is,
   1. wait until connected.
   2. call the native exits() call for the state.
   
   The benefit is mainly avoiding potential issues if we have retryuntilconnect 
inside another retry.

##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,50 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() 
{
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of 
children,
+            // because there is a higher possibility that connection loss is 
caused by other
+            // factors such as network connectivity, connected ZK node could 
not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of 
children.
+              // Infinitely retrying connecting may cause high GC in ZK server 
and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to 
exit retry loop.
+              // TODO: remove this check once we have a better way to exit 
infinite retry
+              Stat stat = getStat(path);
+              if (stat != null) {
+                if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                  LOG.error("Failed to get children for path {} because number 
of children {} "
+                          + "exceeds limit {}, aborting retry.", path, 
stat.getNumChildren(),
+                      NUM_CHILDREN_LIMIT);
+                  // MarshallingErrorException could represent transport 
error: exceeding the
+                  // Jute buffer size. So use it to exit retry loop and tell 
that zk is not able to
+                  // transport the data because packet length is too large.
+                  throw new KeeperException.MarshallingErrorException();
+                } else {
+                  // No need to do stat again for next connection loss.
+                  connectionLossRetryCount = 0;

Review comment:
       I think you might want to keep checking the children's number on any 
retries after the first 3. Or, what if the children are increased during the 
retry, right?
   
   So just make the logic like,
   if (count < 3 || getChildrenCount() < limit) {
     count++;
     retry;
   } else {
     exist;
   } 
   
   

##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,50 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() 
{
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of 
children,
+            // because there is a higher possibility that connection loss is 
caused by other
+            // factors such as network connectivity, connected ZK node could 
not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of 
children.
+              // Infinitely retrying connecting may cause high GC in ZK server 
and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to 
exit retry loop.
+              // TODO: remove this check once we have a better way to exit 
infinite retry
+              Stat stat = getStat(path);
+              if (stat != null) {
+                if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                  LOG.error("Failed to get children for path {} because number 
of children {} "

Review comment:
       Let's make the comment clear.
   "Failed to get children for path {} because of connection loss. Stop 
retrying because the number of children exceeds......"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to