pkuwm commented on a change in pull request #1109:
URL: https://github.com/apache/helix/pull/1109#discussion_r454878193
##########
File path:
zookeeper-api/src/main/java/org/apache/helix/zookeeper/zkclient/ZkClient.java
##########
@@ -984,11 +997,50 @@ private void fireAllEvents() {
protected List<String> getChildren(final String path, final boolean watch) {
long startT = System.currentTimeMillis();
+
try {
List<String> children = retryUntilConnected(new Callable<List<String>>()
{
+ private int connectionLossRetryCount = 0;
+
@Override
public List<String> call() throws Exception {
- return getConnection().getChildren(path, watch);
+ try {
+ return getConnection().getChildren(path, watch);
+ } catch (ConnectionLossException e) {
+ ++connectionLossRetryCount;
+ // Allow retrying 3 times before checking stat checking number of
children,
+ // because there is a higher possibility that connection loss is
caused by other
+ // factors such as network connectivity, connected ZK node could
not serve
+ // the request, session expired, etc.
+ if (connectionLossRetryCount >= 3) {
+ // Issue: https://github.com/apache/helix/issues/962
+ // Connection loss might be caused by an excessive number of
children.
+ // Infinitely retrying connecting may cause high GC in ZK server
and kill ZK server.
+ // This is a workaround to check numChildren to have a chance to
exit retry loop.
+ // TODO: remove this check once we have a better way to exit
infinite retry
+ Stat stat = getStat(path);
+ if (stat != null) {
+ if (stat.getNumChildren() > NUM_CHILDREN_LIMIT) {
+ LOG.error("Failed to get children for path {} because number
of children {} "
+ + "exceeds limit {}, aborting retry.", path,
stat.getNumChildren(),
+ NUM_CHILDREN_LIMIT);
+ // MarshallingErrorException could represent transport
error: exceeding the
+ // Jute buffer size. So use it to exit retry loop and tell
that zk is not able to
+ // transport the data because packet length is too large.
+ throw new KeeperException.MarshallingErrorException();
+ } else {
+ // No need to do stat again for next connection loss.
+ connectionLossRetryCount = 0;
Review comment:
> I think you might want to keep checking the children's number on any
retries after the first 3. Or, what if the children are increased during the
retry, right?
Yes, it will still check numChildren after the first 3 retries. The strategy
is check numChildren every other 3 retries.
Your code checks numChildren every time after the first 3 retries, right?
But actually I don't think it is necessary to check every time after 3 retries.
I would like to just check numChildren every other 3 retries. So maybe your
code could not really simplify the logic. I'll sync up with you.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]