This email is to suggest the version bump up of ZkClient library used by Helix.
1. We have noticed that sometimes ZK calls hang due to unknown reasons. This kind of issue seems to be commonly experienced by HBase users, but various fixes have been incorporated to ZK in versions 3.4+. The client version 0.5 is based on an older version of 3.4. I will attach a jstack log reported by one of our open source users, Gobblin at the end of this email. 2. We have already upgraded ZK server to 3.4.13. The corresponding version for the ZkClient library is 0.11 (See https://github.com/sgroschupf/zkclient/blob/master/CHANGELOG.markdown <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsgroschupf%2Fzkclient%2Fblob%2Fmaster%2FCHANGELOG.markdown&data=02%7C01%7Chulee%40linkedin.com%7Ce8f2dd7d1fae49143ad308d6fb2456ae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636972530820384140&sdata=1AFJyM%2BQ%2Bu2uL28rYrAHDzogtk0C%2Bn95aJJdLJ7bYPM%3D&reserved=0>). Other heavy users of ZooKeeper such as Kafka have already upgraded to 0.11. 3. We will first proceed by testing it at LinkedIn's testing clusters to make sure there are no obvious signs of regression. Overall, the goal is to further stabilize ZK-related operations in Helix. Please take a look at the CHANGELOG linked above for more details on what changed across ZkClient versions. Let me know what you think, Hunter -------------- "FetchJobSpecExecutor" #88 prio=5 os_prio=0 tid=0x00007f8f8ab2c800 nid=0x25e9 in Object.wait() [0x00007f8f5c13b000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) - locked <0x000000076af719c0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1470) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) at org.apache.helix.manager.zk.zookeeper.ZkConnection.getChildren(ZkConnection.java:127) at org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:698) at org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:695) at org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1102) at org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:695) at org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:689) at org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildNames(ZkBaseDataAccessor.java:507) at org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:463) at org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:431) at org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:409) at org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:468) at org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:459) at org.apache.helix.task.TaskDriver.getWorkflows(TaskDriver.java:847) at org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:287) at org.apache.gobblin.cluster.GobblinHelixJobScheduler.cancelJobIfRequired(GobblinHelixJobScheduler.java:363) at org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleDeleteJobConfigArrival(GobblinHelixJobScheduler.java:352) at org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleUpdateJobConfigArrival(GobblinHelixJobScheduler.java:322)