saurabhd336 opened a new issue, #2685:
URL: https://github.com/apache/helix/issues/2685

   Running into an issue with https://github.com/apache/pinot using helix 
version 1.3.1, wherein there seems to be an inconsistency b/w the returned 
status code of the `ZkBaseDataAccessor.set` vs the expected outcome. The `set` 
call to update the idealState 
[here](https://github.com/apache/pinot/blob/master/pinot-common/src/main/java/org/apache/pinot/common/utils/helix/HelixHelper.java#L146),
 fails with retCode != OK, but the IdealState ZNode does get updated 
successfully. This is causing unexpected behaviour further downstream in the 
system, which relies on the return code of the set call. 
   
   Relevant logs from helix calls
   ```
   2023/10/15 21:13:08.573 ERROR [ZkBaseDataAccessor] 
[jersey-server-managed-async-executor-24] Exception while setting path: 
/pinot/IDEALSTATES/myTable_OFFLINE
   org.apache.helix.zookeeper.zkclient.exception.ZkInterruptedException: 
java.lang.InterruptedException
           at 
org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:2002)
 
           at 
org.apache.helix.zookeeper.zkclient.ZkClient.writeDataReturnStat(ZkClient.java:2220)
 
           at 
org.apache.helix.zookeeper.zkclient.ZkClient.writeDataGetStat(ZkClient.java:2236)
 
           at 
org.apache.helix.manager.zk.ZkBaseDataAccessor.doSet(ZkBaseDataAccessor.java:365)
 
           at 
org.apache.helix.manager.zk.ZkBaseDataAccessor.set(ZkBaseDataAccessor.java:344) 
           at 
org.apache.pinot.common.utils.helix.HelixHelper$1.call(HelixHelper.java:146) 
           at 
org.apache.pinot.common.utils.helix.HelixHelper$1.call(HelixHelper.java:104) 
           at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:50)
 
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:104)
 
           at 
org.apache.pinot.common.utils.helix.HelixHelper.updateIdealState(HelixHelper.java:227)
 
           at 
org.apache.pinot.controller.helix.core.PinotHelixResourceManager.assignTableSegment(PinotHelixResourceManager.java:2265)
 
           at 
org.apache.pinot.controller.api.upload.ZKOperator.processNewSegment(ZKOperator.java:333)
 
           at 
org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:93)
 
           at 
org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:387)
 
           at 
org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegmentAsMultiPart(PinotSegmentUploadDownloadRestletResource.java:550)
 
           at jdk.internal.reflect.GeneratedMethodAccessor402.invoke(Unknown 
Source) ~[?:?]
           at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
           at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
           at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
 
           at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
 
           at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
 
           at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
 
           at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
 
           at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
 
           at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.lambda$apply$0(ResourceMethodInvoker.java:390)
 
           at 
org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2$1.run(ServerRuntime.java:830)
           at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
           at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
           at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
           at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
           at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
           at 
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
           at 
org.glassfish.jersey.server.ServerRuntime$AsyncResponder$2.run(ServerRuntime.java:825)
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
           at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
           at java.lang.Thread.run(Thread.java:829) [?:?]
   Caused by: java.lang.InterruptedException
           at java.lang.Object.wait(Native Method) ~[?:?]
           at java.lang.Object.wait(Object.java:328) ~[?:?]
           at 
org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1594) 
           at 
org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1566) 
           at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2559) 
           at 
org.apache.helix.zookeeper.zkclient.ZkConnection.writeDataReturnStat(ZkConnection.java:211)
 
           at 
org.apache.helix.zookeeper.zkclient.ZkClient.lambda$writeDataReturnStat$9(ZkClient.java:2221)
 
           at 
org.apache.helix.zookeeper.zkclient.ZkClient.retryUntilConnected(ZkClient.java:1986)
 
           ... 37 more
           ```
   ```
   
   Worth noting, the helixZK manager may have been shutting down at around the 
same time. This could explain why the ClientCnxn.java's wait() got interrupted, 
either way the expectation is for the `ZkBaseDataAccessor.set`'s status being 
returned to reflect the correct status of the operation.
    
   ```
   atch leader session 2000b1610f82230
   2023/10/15 21:12:57.666 WARN [ZKHelixManager] [prometheus-http-1-5] Instance 
pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000
 is not leader of cluster pinot due to current session 201154a5bc81aaa does not 
match leader session 2000b1610f82230
   2023/10/15 21:13:07.941 INFO [ZKHelixManager] [Thread-57] disconnect 
Controller_pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000(PARTICIPANT)
 from pinot
   2023/10/15 21:13:08.006 INFO [ParticipantHealthReportTask] [Thread-57] Stop 
ParticipantHealthReportTimerTask
   2023/10/15 21:13:08.006 INFO [HelixTaskExecutor] [Thread-57] Shutting down 
HelixTaskExecutor
   2023/10/15 21:13:08.006 INFO [HelixTaskExecutor] [Thread-57] Reset 
HelixTaskExecutor
   2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean 
ClusterStatus:cluster=pinot,messageQueue=Controller_pinot-pinot-controller-1.pinot-pinot-controller-headless.managed.svc.cluster.local_9000
 has been un-registered.
   2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean 
HelixThreadPoolExecutor:Type=USER_DEFINE_MSG has been un-registered.
   2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Reset executor 
for msgType: USER_DEFINE_MSG, pool: 
java.util.concurrent.ThreadPoolExecutor@3a4d0ca2[Running, pool size = 0, active 
threads = 0, queued tasks = 0, completed tasks = 0]
   2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Shutting down 
pool: java.util.concurrent.ThreadPoolExecutor@3a4d0ca2[Running, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 0]
   2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean 
HelixThreadPoolExecutor:Type=TASK_REPLY has been un-registered.
   2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Reset executor 
for msgType: TASK_REPLY, pool: 
java.util.concurrent.ThreadPoolExecutor@45fa7fb7[Running, pool size = 0, active 
threads = 0, queued tasks = 0, completed tasks = 0]
   2023/10/15 21:13:08.008 INFO [HelixTaskExecutor] [Thread-57] Shutting down 
pool: java.util.concurrent.ThreadPoolExecutor@45fa7fb7[Running, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 0]
   2023/10/15 21:13:08.008 INFO [MBeanRegistrar] [Thread-57] MBean 
HelixThreadPoolExecutor:Type=STATE_TRANSITION has been un-registered.
   2023/10/15 21:13:08.009 INFO [HelixTaskExecutor] [Thread-57] Reset executor 
for msgType: STATE_TRANSITION, pool: 
java.util.concurrent.ThreadPoolExecutor@3c83e9ca[Running, pool size = 16, 
active threads = 0, queued tasks = 0, completed tasks = 16]
   2023/10/15 21:13:08.009 INFO [HelixTaskExecutor] [Thread-57] Shutting down 
pool: java.util.concurrent.ThreadPoolExecutor@3c83e9ca[Running, pool size = 16, 
active threads = 0, queued tasks = 0, completed tasks = 16]
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Resetting 
HelixStateMachineEngine
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_15
   2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_23
   2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_12
   2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_1
   2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_20
   2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_18
   2023/10/15 21:13:08.013 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.013 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_8
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_5
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] 
Successfully reset HelixStateMachineEngine
   2023/10/15 21:13:08.014 INFO [MBeanRegistrar] [Thread-57] MBean 
HelixThreadPoolExecutor:Type=STATE_TRANSITION_CANCELLATION has been 
un-registered.
   2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] Reset executor 
for msgType: STATE_TRANSITION_CANCELLATION, pool: 
java.util.concurrent.ThreadPoolExecutor@6c427386[Running, pool size = 0, active 
threads = 0, queued tasks = 0, completed tasks = 0]
   2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57] Shutting down 
pool: java.util.concurrent.ThreadPoolExecutor@6c427386[Running, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 0]
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Resetting 
HelixStateMachineEngine
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_15
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_23
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_12
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_1
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_20
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_18
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_8
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] Operating 
on leadControllerResource::leadControllerResource_5
   2023/10/15 21:13:08.014 WARN [StateModel] [Thread-57] Default reset method 
invoked. Either because the process longer own this resource or session timedout
   2023/10/15 21:13:08.014 INFO [HelixStateMachineEngine] [Thread-57] 
Successfully reset HelixStateMachineEngine
   2023/10/15 21:13:08.014 INFO [HelixTaskExecutor] [Thread-57]
   ```
   
   Any help in helping identifying the root cause / remedial steps will be 
appreciated thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to