saurabhd336 opened a new issue, #2685: URL: https://github.com/apache/helix/issues/2685
Running into an issue with https://github.com/apache/pinot using helix version 1.3.1, wherein there seems to be an inconsistency b/w the returned status code of the `ZkBaseDataAccessor.set` vs the expected outcome. The `set` call to update the idealState [here](https://github.com/apache/pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/common/utils/helix/HelixHelper.java#L146), fails with retCode != OK, but the IdealState ZNode does get updated successfully. This is causing unexpected behaviour further downstream in the system, which relies on the return code of the set call. Relevant logs from helix calls ``` 2023/10/15 21:13:08.573 ERROR [ZkBaseDataAccessor] [jersey-server-managed-async-executor-24] Exception while setting path: /pinot/IDEALSTATES/myTable_OFFLINE org.apache.helix.zookeeper.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2002) at org.apache.helix.zookeeper.zkclient.ZkClient.writeDataReturnStat(ZkClient.java:2220) at org.apache.helix.zookeeper.zkclient.ZkClient.writeDataGetStat(ZkClient.java:2236) at org.apache.helix.manager.zk.ZkBaseDataAccessor.doSet(ZkBaseDataAccessor.java:365) at org.apache.helix.manager.zk.ZkBaseDataAccessor.set(ZkBaseDataAccessor.java:344) at org.apache.pinot.common.utils.helix.HelixHelper$1.call(HelixHelper.java:146) at org.apache.pinot.common.utils.helix.HelixHelper$1.call(HelixHelper.java:104) at org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50) at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:104) at org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:227) at org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:2265) at org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:333) at org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:93) at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:387) at org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsMultiPart(PinotSegmentUploadDownloadRestletResource.java:550) at jdk.internal.reflect.GeneratedMethodAccessor402.invoke(Unknown Source) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) at org.glassfish.jersey.server.model.ResourceMethodInvoker.lambda$apply$0(ResourceMethodInvoker.java:390) at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2$1.run(ServerRuntime.java:830) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) at org.glassfish.jersey.internal.Errors.process(Errors.java:292) at org.glassfish.jersey.internal.Errors.process(Errors.java:274) at org.glassfish.jersey.internal.Errors.process(Errors.java:244) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2.run(ServerRuntime.java:825) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) ~[?:?] at java.lang.Object.wait(Object.java:328) ~[?:?] at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1594) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1566) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2559) at org.apache.helix.zookeeper.zkclient.ZkConnection.writeDataReturnStat(ZkConnection.java:211) at org.apache.helix.zookeeper.zkclient.ZkClient.lambda$writeDataReturnStat$9(ZkClient.java:2221) at org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986) ... 37 more ``` ``` Worth noting, the helixZK manager may have been shutting down at around the same time. This could explain why the ClientCnxn.java's wait() got interrupted, either way the expectation is for the `ZkBaseDataAccessor.set`'s status being returned to reflect the correct status of the operation. ``` atch leader session 2000b1610f82230 2023/10/15 21:12:57.666 WARN [ZKHelixManager] [prometheus-http-1-5] Instance pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000 is not leader of cluster pinot due to current session 201154a5bc81aaa does not match leader session 2000b1610f82230 2023/10/15 21:13:07.941 INFO [ZKHelixManager] [Thread-57] disconnect Controller_pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000(PARTICIPANT) from pinot 2023/10/15 21:13:08.006 INFO [ParticipantHealthReportTask] [Thread-57] Stop ParticipantHealthReportTimerTask 2023/10/15 21:13:08.006 INFO [HelixTaskExecutor] [Thread-57] Shutting down HelixTaskExecutor 2023/10/15 21:13:08.006 INFO [HelixTaskExecutor] [Thread-57] Reset HelixTaskExecutor 2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean ClusterStatus:cluster=pinot,messageQueue=Controller_pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000 has been un-registered. 2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=USER_DEFINE_MSG has been un-registered. 2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: USER_DEFINE_MSG, pool: java.util.concurrent.ThreadPoolExecutor@3a4d0ca2[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@3a4d0ca2[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=TASK_REPLY has been un-registered. 2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: TASK_REPLY, pool: java.util.concurrent.ThreadPoolExecutor@45fa7fb7[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@45fa7fb7[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=STATE_TRANSITION has been un-registered. 2023/10/15 21:13:08.009 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: STATE_TRANSITION, pool: java.util.concurrent.ThreadPoolExecutor@3c83e9ca[Running, pool size = 16, active threads = 0, queued tasks = 0, completed tasks = 16] 2023/10/15 21:13:08.009 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@3c83e9ca[Running, pool size = 16, active threads = 0, queued tasks = 0, completed tasks = 16] 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Resetting HelixStateMachineEngine 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_15 2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_23 2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_12 2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_1 2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_20 2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_18 2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_8 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_5 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Successfully reset HelixStateMachineEngine 2023/10/15 21:13:08.014 INFO [MBeanRegistrar] [Thread-57] MBean HelixThreadPoolExecutor:Type=STATE_TRANSITION_CANCELLATION has been un-registered. 2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] Reset executor for msgType: STATE_TRANSITION_CANCELLATION, pool: java.util.concurrent.ThreadPoolExecutor@6c427386[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@6c427386[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Resetting HelixStateMachineEngine 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_15 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_23 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_12 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_1 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_20 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_18 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_8 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating on leadControllerResource::leadControllerResource_5 2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method invoked. Either because the process longer own this resource or session timedout 2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Successfully reset HelixStateMachineEngine 2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] ``` Any help in helping identifying the root cause / remedial steps will be appreciated thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
