[ https://issues.apache.org/jira/browse/HELIX-748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564149#comment-16564149 ]
Jiajun Wang commented on HELIX-748: ----------------------------------- Good point. We shall certainly do that. Besides this concern, we need to resolve another issue as well. Note in the proposed code, we keep retrying on any Exceptions except ZkException or InterruptedException. This could be dangerous. If any callback logic throws random Exception because of their business logic, the client call will keep retrying forever. So, 2 options: # Check all possible Exception thrown by the Zk operation call. Only throwing KeeperExceptions so we know when to retry when to stop. # Change ZkConnection processing logic to ensure it is never to be null. In this case, any exceptions shall be related to business logic. We can safely end the retry. To implement this, we can implement an atomic connection swap logic. So that the ZkConnection ref is always valid. Based on our investigation, option 2 seems to be a cleaner design. ZkConnection is used everywhere. Any possibility that this ref to be null means more error handling work. > ZkClient should not throw Exception when internal ZkConnection is reset > ----------------------------------------------------------------------- > > Key: HELIX-748 > URL: https://issues.apache.org/jira/browse/HELIX-748 > Project: Apache Helix > Issue Type: Task > Reporter: Jiajun Wang > Assignee: Jiajun Wang > Priority: Major > > It is noticed that ZkClient throws an exception because of ZkConnection == > null when it is reset. > This could be caused by an expiring session handling. According to the > design, ZkClient operation should wait until reset done, instead of break the > retry. -- This message was sent by Atlassian JIRA (v7.6.3#76005)