kaisun2000 commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r445226103
##########
File path:
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,51 @@ private void fireAllEvents() {
protected List<String> getChildren(final String path, final boolean watch) {
long startT = System.currentTimeMillis();
+
try {
List<String> children = retryUntilConnected(new Callable<List<String>>()
{
+ private int connectionLossRetryCount = 0;
+
@Override
public List<String> call() throws Exception {
- return getConnection().getChildren(path, watch);
+ try {
+ return getConnection().getChildren(path, watch);
+ } catch (ConnectionLossException e) {
+ ++connectionLossRetryCount;
+ // Allow retrying 3 times before checking stat checking number of
children,
+ // because there is a higher possibility that connection loss is
caused by other
+ // factors such as network connectivity, connected ZK node could
not serve
+ // the request, session expired, etc.
+ if (connectionLossRetryCount >= 3) {
+ // Issue: https://github.com/apache/helix/issues/962
+ // Connection loss might be caused by an excessive number of
children.
+ // Infinitely retrying connecting may cause high GC in ZK server
and kill ZK server.
+ // This is a workaround to check numChildren to have a chance to
exit retry loop.
+ // TODO: remove this check once we have a better way to exit
infinite retry
+ Stat stat = getStat(path);
+ if (stat != null) {
+ if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+ LOG.error("Failed to get children for path {} because number
of children {} "
+ + "exceeds limit {}, aborting retry.", path,
stat.getNumChildren(),
+ NUM_CHILDREN_LIMIT);
+ // There is not an accurate KeeperException for the purpose.
+ // MarshallingErrorException could represent transport error,
+ // so use it to exit retry loop and tell that zk is not able
to
+ // transport the data because packet length is too large.
+ throw new KeeperException.MarshallingErrorException();
Review comment:
As discussed offline. Here is the basic principle inherited from IOITech
this layer, we see KeeperException,
user see ZkException,
retryUntilConnected, throw ZkException and use see ZkException.
Here is an idea:
So maybe we can just add another type of ZkException say
TooManyChildrenZkException and convert this from retryUntilConnect to this new
ZkException. User of ZkClient can handle this TooManyChidrenZkException with
whatever logic they want.
##########
File path:
zookeeper-api/src/test/java/org/apache/helix/zookeeper/impl/client/TestRawZkClient.java
##########
@@ -799,4 +802,66 @@ public void testAsyncWriteOperations() {
zkClient.delete("/tmp/asyncOversize");
}
}
+
+ /*
+ * Tests getChildren() when there are an excessive number of children and
connection loss happens,
+ * the operation should terminate and exit retry loop.
+ */
+ @Test
+ public void testGetChildrenOnLargeNumChildren() throws Exception {
+ // Default packetLen is 4M. It is static final and initialized
Review comment:
This test relies on implementation detail of ZKClient. Can we just
simulate the real scenario by creating say 10K children with name 100 bytes?
Something would stand for longer time.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]