[GitHub] [helix] pkuwm commented on a change in pull request #1109: ZkClient should not keep retrying getChildren() due to large number of children

GitBox Mon, 22 Jun 2020 18:20:04 -0700


pkuwm commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r443906147




##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -103,6 +109,13 @@
   // ZkEventThread. Otherwise the retry request might block the normal event 
processing.
   protected final ZkAsyncRetryThread _asyncCallRetryThread;
 
+  static {

Review comment:
       I don't want to change the original init way for 
MAX_RECONNECT_INTERVAL_MS.
   Like I comment, in terms of NUM_CHILDREN_LIMIT, the purpose is for unit 
test. I explain it in comment: Set it here for unit test to use reflection to 
change value because compilers optimize constants by replacing them inline. If 
we don't do it like this, we would not be able to unit test getChildren() 
unless we create > 100K children.

##########
File path: 
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,51 @@ private void fireAllEvents() {
 
   protected List<String> getChildren(final String path, final boolean watch) {
     long startT = System.currentTimeMillis();
+
     try {
       List<String> children = retryUntilConnected(new Callable<List<String>>() 
{
+        private int connectionLossRetryCount = 0;
+
         @Override
         public List<String> call() throws Exception {
-          return getConnection().getChildren(path, watch);
+          try {
+            return getConnection().getChildren(path, watch);
+          } catch (ConnectionLossException e) {
+            ++connectionLossRetryCount;
+            // Allow retrying 3 times before checking stat checking number of 
children,
+            // because there is a higher possibility that connection loss is 
caused by other
+            // factors such as network connectivity, connected ZK node could 
not serve
+            // the request, session expired, etc.
+            if (connectionLossRetryCount >= 3) {
+              // Issue: https://github.com/apache/helix/issues/962
+              // Connection loss might be caused by an excessive number of 
children.
+              // Infinitely retrying connecting may cause high GC in ZK server 
and kill ZK server.
+              // This is a workaround to check numChildren to have a chance to 
exit retry loop.
+              // TODO: remove this check once we have a better way to exit 
infinite retry
+              Stat stat = getStat(path);
+              if (stat != null) {
+                if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+                  LOG.error("Failed to get children for path {} because number 
of children {} "
+                          + "exceeds limit {}, aborting retry.", path, 
stat.getNumChildren(),
+                      NUM_CHILDREN_LIMIT);
+                  // There is not an accurate KeeperException for the purpose.
+                  // MarshallingErrorException could represent transport error,
+                  // so use it to exit retry loop and tell that zk is not able 
to
+                  // transport the data because packet length is too large.
+                  throw new KeeperException.MarshallingErrorException();

Review comment:
       I considered both. `BadArgumentsException` was the one I tried to use, 
but realized that `MarshallingErrorException` could represent a transport error 
and mostly mean data exceeds jute.maxbuffer. Here is an example: 
https://tgockel.github.io/zookeeper-cpp/classzk_1_1marshalling__error.html#details
   `An error occurred while marshalling data. The most common cause of this is 
exceeding the Jute buffer size – meaning the transaction was too large`
   So I decided to choose MarshallingError.  BadArgumentsException for me 
sounds more like the API is misused, eg, invalid path that has extra slash 
"/path/".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [helix] pkuwm commented on a change in pull request #1109: ZkClient should not keep retrying getChildren() due to large number of children

Reply via email to