Upon discussion, the direction should actually be to move away from using IOItec's ZkClient due to the following reasons: 1. ZK version dependency 2. Helix's own ZKclient contains a lot of custom logic, different from ZkClient.
We will proceed in this direction. Hunter On Thu, Jun 27, 2019 at 10:28 AM Hunter Lee <hu...@apache.org> wrote: > This email is to suggest the version bump up of ZkClient library used by > Helix. > > 1. We have noticed that sometimes ZK calls hang due to unknown > reasons. This kind of issue seems to be commonly experienced by HBase > users, but various fixes have been incorporated to ZK in versions 3.4+. The > client version 0.5 is based on an older version of 3.4. I will attach a > jstack log reported by one of our open source users, Gobblin at the end of > this email. > 2. We have already upgraded ZK server to 3.4.13. The corresponding > version for the ZkClient library is 0.11 (See > https://github.com/sgroschupf/zkclient/blob/master/CHANGELOG.markdown > > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsgroschupf%2Fzkclient%2Fblob%2Fmaster%2FCHANGELOG.markdown&data=02%7C01%7Chulee%40linkedin.com%7Ce8f2dd7d1fae49143ad308d6fb2456ae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636972530820384140&sdata=1AFJyM%2BQ%2Bu2uL28rYrAHDzogtk0C%2Bn95aJJdLJ7bYPM%3D&reserved=0>). > Other heavy users of ZooKeeper such as Kafka have already upgraded to 0.11. > 3. We will first proceed by testing it at LinkedIn's testing clusters > to make sure there are no obvious signs of regression. > > Overall, the goal is to further stabilize ZK-related operations in Helix. > Please take a look at the CHANGELOG linked above for more details on what > changed across ZkClient versions. > > Let me know what you think, > Hunter > > -------------- > > "FetchJobSpecExecutor" #88 prio=5 os_prio=0 tid=0x00007f8f8ab2c800 > nid=0x25e9 in Object.wait() [0x00007f8f5c13b000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342) > - locked <0x000000076af719c0> (a > org.apache.zookeeper.ClientCnxn$Packet) > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1470) > > at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500) > > at > org.apache.helix.manager.zk.zookeeper.ZkConnection.getChildren(ZkConnection.java:127) > > at > org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:698) > at > org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:695) > at > org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1102) > > at > org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:695) > > at > org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:689) > > at > org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildNames(ZkBaseDataAccessor.java:507) > > at > org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:463) > > at > org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:431) > > at > org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:409) > > at > org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:468) > > at > org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:459) > > at > org.apache.helix.task.TaskDriver.getWorkflows(TaskDriver.java:847) > at > org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:287) > > at > org.apache.gobblin.cluster.GobblinHelixJobScheduler.cancelJobIfRequired(GobblinHelixJobScheduler.java:363) > > at > org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleDeleteJobConfigArrival(GobblinHelixJobScheduler.java:352) > > at > org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleUpdateJobConfigArrival(GobblinHelixJobScheduler.java:322) >