Asakiny opened a new issue, #17265: URL: https://github.com/apache/dolphinscheduler/issues/17265
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened Deploying version 3.2.2 on k8s with helm chart, and dolphinscheduler connect zookeeper by svc address.When the zookeeper pod restart, dolphinscheduler-api/master/worker can't reconnect zookeeper when PodIP change, we must restart dolphinscheduler api/master/worker to reconnect zookeeper. When zookeeper restart, dolphinscheduler logs : ``` [WARN] 2025-06-17 17:15:49.287 +0800 o.a.z.ClientCnxn:[1292] - Session 0x301d5683db40000 for server dolphischeduler-zookeeper-2.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local/100.97.145.159:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x301d5683db40000, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2025-06-17 17:15:49.388 +0800 o.a.c.ConnectionState:[284] - Connection string changed to: 100.96.39.151:2181,100.97.26.240:2181,100.97.145.159:2181 [INFO] 2025-06-17 17:15:49.388 +0800 o.a.c.f.s.ConnectionStateManager:[252] - State change: SUSPENDED [INFO] 2025-06-17 17:15:50.222 +0800 o.a.c.f.s.ConnectionStateManager:[252] - State change: RECONNECTED [INFO] 2025-06-17 17:15:50.224 +0800 o.a.c.f.i.EnsembleTracker:[201] - New config event received: {server.1=dolphischeduler-zookeeper-0.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, version=0, server.3=dolphischeduler-zookeeper-2.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local:2888:3888:participant;0.0.0.0:2181, server.2=dolphischeduler-zookeeper-1.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local:2888:3888:participant;0.0.0.0:2181} [ERROR] 2025-06-17 17:15:50.227 +0800 o.a.c.f.i.CuratorFrameworkImpl:[733] - Background exception was not retry-able or retry gave up java.lang.NullPointerException: null at org.apache.curator.utils.Compatibility.getHostAddress(Compatibility.java:116) at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:926) at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:683) at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:634) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:553) 100.110.18.239 - - [17/Jun/2025:09:15:51 +0000] "GET /dolphinscheduler/actuator/prometheus HTTP/1.1" 200 73075 6ms [WARN] 2025-06-17 17:16:00.777 +0800 o.a.z.ClientCnxn:[1292] - Session 0x301d5683db40000 for server dolphischeduler-zookeeper-1.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local/100.97.26.240:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x301d5683db40000, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2025-06-17 17:16:00.878 +0800 o.a.c.f.s.ConnectionStateManager:[252] - State change: SUSPENDED 100.110.18.239 - - [17/Jun/2025:09:16:06 +0000] "GET /dolphinscheduler/actuator/prometheus HTTP/1.1" 200 73077 6ms ``` Finally, the ConnectionState will change to LOST ``` [WARN] 2025-06-17 19:18:12.780 +0800 o.a.c.ConnectionState:[316] - Session expired event received [INFO] 2025-06-17 19:18:12.781 +0800 o.a.c.f.s.ConnectionStateManager:[252] - State change: LOST ``` the worker logs is ``` org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 20020ms for session id 0x0 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1250) [WI-0][TI-0] - [WARN] 2025-06-16 15:06:48.593 +0800 o.a.c.r.ExponentialBackoffRetry:[74] - Sleep extension too large (10000). Pinning to 3000 [WI-0][TI-0] - [ERROR] 2025-06-16 15:06:49.110 +0800 o.a.d.s.w.r.WorkerWaitingStrategy:[85] - Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleException: Waiting to reconnect to registry in PT1M40S failed at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:78) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208) at com.sun.proxy.$Proxy111.disconnect(Unknown Source) at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:53) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:47) at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281) at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.dolphinscheduler.registry.api.RegistryException: Cannot connect to registry in 100 s at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.connectUntilTimeout(ZookeeperRegistry.java:125) at org.apache.dolphinscheduler.registry.api.RegistryClient.connectUntilTimeout(RegistryClient.java:73) at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:75) ... 20 common frames omitted [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.120 +0800 o.a.d.s.w.WorkerServer:[130] - Worker server is stopping, current cause : Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server [WI-0][TI-0] - [WARN] 2025-06-16 15:06:52.120 +0800 o.a.d.c.m.BaseHeartBeatTask:[84] - WorkerHeartBeatTask finished... [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.125 +0800 o.a.c.f.i.CuratorFrameworkImpl:[998] - backgroundOperationsLoop exiting [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.597 +0800 o.a.z.ClientCnxn:[1171] - Opening socket connection to server [100.96.39.186/100.96.39.186:2181](http://100.96.39.186/100.96.39.186:2181). [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.597 +0800 o.a.z.ClientCnxn:[1173] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [WI-0][TI-0] - [WARN] 2025-06-16 15:06:52.698 +0800 o.a.c.r.ExponentialBackoffRetry:[74] - Sleep extension too large (29000). Pinning to 3000 [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.698 +0800 o.a.z.ClientCnxn:[568] - EventThread shut down for session: 0x0 [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.698 +0800 o.a.z.ZooKeeper:[1232] - Session: 0x0 closed [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.699 +0800 o.a.d.s.w.r.WorkerRegistryClient:[136] - Worker registry client closed [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.701 +0800 o.a.d.e.b.s.NettyRemotingServer:[159] - netty server closed [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.701 +0800 o.a.d.s.w.WorkerServer:[143] - Worker server stopped, current cause: Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server [WI-0][TI-0] - [ERROR] 2025-06-16 15:06:55.699 +0800 o.a.d.c.m.BaseHeartBeatTask:[71] - WorkerHeartBeatTask task execute failed org.apache.dolphinscheduler.registry.api.RegistryException: Failed to put registry key: /nodes/worker/dolphinscheduler-worker-0.dolphinscheduler-worker-headless:1234 at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.put(ZookeeperRegistry.java:182) at org.apache.dolphinscheduler.registry.api.RegistryClient.persistEphemeral(RegistryClient.java:174) at org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask.writeHeartBeat(WorkerHeartBeatTask.java:87) at org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask.writeHeartBeat(WorkerHeartBeatTask.java:37) at org.apache.dolphinscheduler.common.model.BaseHeartBeatTask.run(BaseHeartBeatTask.java:67) Caused by: java.lang.IllegalStateException: Client is not started at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:139) at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:649) at org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216) at org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.put(ZookeeperRegistry.java:180) ... 4 common frames omitted ``` ### What you expected to happen Dolphinscheduler components should reconnect zookeeper pod automatically, ### How to reproduce Dolphinscheduler on k8s ,when zookeeper pod restart , the probelm will be reproduced 100% ### Anything else _No response_ ### Version 3.2.x ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
