vishalsuvagia opened a new issue, #3071: URL: https://github.com/apache/helix/issues/3071
### Describe the bug Apache Ambari Metrics is using Helix for cluster management tasks. Recently tried to upgrade the Helix dependency from 0.6.6 to 1.3.2 / 1.4.3; however, we are seeing a failure in Metrics Collector startup when the Hadoop cluster is deployed in kerberos enabled mode with the newer version of Helix. Based on the investigation, I would like to pin down the issues because of the change in the Helix Core Zk initialisation which fails to create the zookeeper client and service shutdown is triggered with below error in the trace. > 2025-09-17 10:54:29,633 WARN org.apache.helix.manager.zk.ZKHelixManager: **zkClient to testnode01.mycluster.org:2181 is not connected**, wait for 10000ms. > 2025-09-17 10:54:39,635 ERROR org.apache.helix.manager.zk.ZKHelixManager: **zkClient is not connected after waiting 10000ms**., > clusterName: ambari-metrics-cluster, zkAddress: testnode01.mycluster.org:2181 > **ERROR org.apache.helix.manager.zk.ZKHelixManager: fail to createClient. retry 1** > org.apache.helix.HelixException: HelixManager is not connected within retry timeout for cluster ambari-metrics-cluster > at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:417) > at org.apache.helix.manager.zk.ZKHelixManager.getConfigAccessor(ZKHelixManager.java:687) > at org.apache.helix.manager.zk.ParticipantManager.<init>(ParticipantManager.java:118) > at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1440) > at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1390) > at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:782) > at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:817) > at org.apache.ambari.metrics.core.timeline.availability.AggregationTaskRunner.initialize(AggregationTaskRunner.java:135) > at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.startAggregators(MetricCollectorHAController.java:205) > at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.initializeHAController(MetricCollectorHAController.java:184) > at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.initializeSubsystem(HBaseTimelineMetricsService.java:133) > at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.serviceInit(HBaseTimelineMetricsService.java:102) I am trying to understand the change in behaviour from the library side and appropriate fix for the issue and tried few approaches by trying to set zk timeout with system properties, -D arguments and setting helix.zk [session and connection](https://github.com/apache/helix/blob/master/helix-common/src/main/java/org/apache/helix/SystemPropertyKeys.java#L51-L53) timeouts, rewriting ZkHelixManager object initialisation by adding a RealmAwareZkClient, RealmAwareZkClientConfig, CloudConfig and HelixManagerProperty object instances using required parameters, but so far none seem to have worked. Request to kindly help and guide with an appropriate fix for the issue. For reference(Apache Ambari Metrics Helix upgrade https://github.com/apache/ambari-metrics/pull/173) cc: @jackjlli / @Jackie-Jiang ### To Reproduce Steps to reproduce the behavior. ### Expected behavior A clear and concise description of what you expected to happen. ### Additional context Add any other context about the problem here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
