Upon discussion, the direction should actually be to move away from using
IOItec's ZkClient due to the following reasons:
1. ZK version dependency
2. Helix's own ZKclient contains a lot of custom logic, different from
ZkClient.

We will proceed in this direction.

Hunter

On Thu, Jun 27, 2019 at 10:28 AM Hunter Lee <hu...@apache.org> wrote:

> This email is to suggest the version bump up of ZkClient library used by
> Helix.
>
>    1. We have noticed that sometimes ZK calls hang due to unknown
>    reasons. This kind of issue seems to be commonly experienced by HBase
>    users, but various fixes have been incorporated to ZK in versions 3.4+. The
>    client version 0.5 is based on an older version of 3.4. I will attach a
>    jstack log reported by one of our open source users, Gobblin at the end of
>    this email.
>    2. We have already upgraded ZK server to 3.4.13. The corresponding
>    version for the ZkClient library is 0.11 (See
>    https://github.com/sgroschupf/zkclient/blob/master/CHANGELOG.markdown
>    
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsgroschupf%2Fzkclient%2Fblob%2Fmaster%2FCHANGELOG.markdown&data=02%7C01%7Chulee%40linkedin.com%7Ce8f2dd7d1fae49143ad308d6fb2456ae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636972530820384140&sdata=1AFJyM%2BQ%2Bu2uL28rYrAHDzogtk0C%2Bn95aJJdLJ7bYPM%3D&reserved=0>).
>    Other heavy users of ZooKeeper such as Kafka have already upgraded to 0.11.
>    3. We will first proceed by testing it at LinkedIn's testing clusters
>    to make sure there are no obvious signs of regression.
>
> Overall, the goal is to further stabilize ZK-related operations in Helix.
> Please take a look at the CHANGELOG linked above for more details on what
> changed across ZkClient versions.
>
> Let me know what you think,
> Hunter
>
> --------------
>
> "FetchJobSpecExecutor" #88 prio=5 os_prio=0 tid=0x00007f8f8ab2c800
> nid=0x25e9 in Object.wait() [0x00007f8f5c13b000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at
> org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
>         - locked <0x000000076af719c0> (a
> org.apache.zookeeper.ClientCnxn$Packet)
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1470)
>
>         at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
>
>         at
> org.apache.helix.manager.zk.zookeeper.ZkConnection.getChildren(ZkConnection.java:127)
>
>         at
> org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:698)
>         at
> org.apache.helix.manager.zk.zookeeper.ZkClient$2.call(ZkClient.java:695)
>         at
> org.apache.helix.manager.zk.zookeeper.ZkClient.retryUntilConnected(ZkClient.java:1102)
>
>         at
> org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:695)
>
>         at
> org.apache.helix.manager.zk.zookeeper.ZkClient.getChildren(ZkClient.java:689)
>
>         at
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildNames(ZkBaseDataAccessor.java:507)
>
>         at
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:463)
>
>         at
> org.apache.helix.manager.zk.ZkBaseDataAccessor.getChildren(ZkBaseDataAccessor.java:431)
>
>         at
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValues(ZKHelixDataAccessor.java:409)
>
>         at
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:468)
>
>         at
> org.apache.helix.manager.zk.ZKHelixDataAccessor.getChildValuesMap(ZKHelixDataAccessor.java:459)
>
>         at
> org.apache.helix.task.TaskDriver.getWorkflows(TaskDriver.java:847)
>         at
> org.apache.gobblin.cluster.HelixUtils.getWorkflowIdsFromJobNames(HelixUtils.java:287)
>
>         at
> org.apache.gobblin.cluster.GobblinHelixJobScheduler.cancelJobIfRequired(GobblinHelixJobScheduler.java:363)
>
>         at
> org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleDeleteJobConfigArrival(GobblinHelixJobScheduler.java:352)
>
>         at
> org.apache.gobblin.cluster.GobblinHelixJobScheduler.handleUpdateJobConfigArrival(GobblinHelixJobScheduler.java:322)
>

Reply via email to