This email is to suggest the version bump up of ZkClient library used by
Helix.

   1. We have noticed that sometimes ZK calls hang due to unknown reasons.
   This kind of issue seems to be commonly experienced by HBase users, but
   various fixes have been incorporated to ZK in versions 3.4+. The client
   version 0.5 is based on an older version of 3.4. I will attach a jstack log
   reported by one of our open source users, Gobblin at the end of this email.
   2. We have already upgraded ZK server to 3.4.13. The corresponding
   version for the ZkClient library is 0.11 (See
   https://github.com/sgroschupf/zkclient/blob/master/CHANGELOG.markdown
   
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsgroschupf%2Fzkclient%2Fblob%2Fmaster%2FCHANGELOG.markdown&data=02%7C01%7Chulee%40linkedin.com%7Ce8f2dd7d1fae49143ad308d6fb2456ae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636972530820384140&sdata=1AFJyM%2BQ%2Bu2uL28rYrAHDzogtk0C%2Bn95aJJdLJ7bYPM%3D&reserved=0>).
   Other heavy users of ZooKeeper such as Kafka have already upgraded to 0.11.
   3. We will first proceed by testing it at LinkedIn's testing clusters to
   make sure there are no obvious signs of regression.

Overall, the goal is to further stabilize ZK-related operations in Helix.
Please take a look at the CHANGELOG linked above for more details on what
changed across ZkClient versions.

Let me know what you think,
Hunter

--------------

"FetchJobSpecExecutor" #88 prio=5 os_prio=0 tid=0x00007f8f8ab2c800
nid=0x25e9 in Object.wait() [0x00007f8f5c13b000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at
org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
        - locked <0x000000076af719c0> (a
org.apache.zookeeper.ClientCnxn$Packet)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1470)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
        at
org.apache.helix.manager.zk.zookeeper.ZkConnection.getChildren(ZkConnection.java:127)

        at
org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:698)
        at
org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:695)
        at
org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1102)

        at
org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:695)

        at
org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:689)

        at
org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildNames(ZkBaseDataAccessor.java:507)

        at
org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:463)

        at
org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:431)

        at
org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:409)

        at
org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:468)

        at
org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:459)

        at
org.apache.helix.task.TaskDriver.getWorkflows(TaskDriver.java:847)
        at
org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:287)

        at
org.apache.gobblin.cluster.GobblinHelixJobScheduler.cancelJobIfRequired(GobblinHelixJobScheduler.java:363)

        at
org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleDeleteJobConfigArrival(GobblinHelixJobScheduler.java:352)

        at
org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleUpdateJobConfigArrival(GobblinHelixJobScheduler.java:322)

Reply via email to