Asakiny opened a new issue, #17265:
URL: https://github.com/apache/dolphinscheduler/issues/17265

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Deploying version 3.2.2 on k8s with helm chart, and dolphinscheduler connect 
zookeeper by svc address.When the zookeeper pod restart, 
dolphinscheduler-api/master/worker can't reconnect zookeeper when PodIP change, 
we must restart dolphinscheduler api/master/worker to reconnect zookeeper.
   When zookeeper restart, dolphinscheduler logs :
   
   ```
   [WARN] 2025-06-17 17:15:49.287 +0800 o.a.z.ClientCnxn:[1292] - Session 
0x301d5683db40000 for server 
dolphischeduler-zookeeper-2.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local/100.97.145.159:2181,
 Closing socket connection. Attempting reconnect except it is a 
SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x301d5683db40000, likely server has 
closed socket
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282)
   [INFO] 2025-06-17 17:15:49.388 +0800 o.a.c.ConnectionState:[284] - 
Connection string changed to: 
100.96.39.151:2181,100.97.26.240:2181,100.97.145.159:2181
   [INFO] 2025-06-17 17:15:49.388 +0800 o.a.c.f.s.ConnectionStateManager:[252] 
- State change: SUSPENDED
   [INFO] 2025-06-17 17:15:50.222 +0800 o.a.c.f.s.ConnectionStateManager:[252] 
- State change: RECONNECTED
   [INFO] 2025-06-17 17:15:50.224 +0800 o.a.c.f.i.EnsembleTracker:[201] - New 
config event received: 
{server.1=dolphischeduler-zookeeper-0.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local:2888:3888:participant;0.0.0.0:2181,
 version=0, 
server.3=dolphischeduler-zookeeper-2.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local:2888:3888:participant;0.0.0.0:2181,
 
server.2=dolphischeduler-zookeeper-1.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local:2888:3888:participant;0.0.0.0:2181}
   [ERROR] 2025-06-17 17:15:50.227 +0800 o.a.c.f.i.CuratorFrameworkImpl:[733] - 
Background exception was not retry-able or retry gave up
   java.lang.NullPointerException: null
           at 
org.apache.curator.utils.Compatibility.getHostAddress(Compatibility.java:116)
           at 
org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185)
           at 
org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206)
           at 
org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50)
           at 
org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150)
           at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:926)
           at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:683)
           at 
org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
           at 
org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222)
           at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:634)
           at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:553)
   100.110.18.239 - - [17/Jun/2025:09:15:51 +0000] "GET 
/dolphinscheduler/actuator/prometheus HTTP/1.1" 200 73075 6ms
   [WARN] 2025-06-17 17:16:00.777 +0800 o.a.z.ClientCnxn:[1292] - Session 
0x301d5683db40000 for server 
dolphischeduler-zookeeper-1.dolphischeduler-zookeeper-headless.dolphischeduler.svc.cluster.local/100.97.26.240:2181,
 Closing socket connection. Attempting reconnect except it is a 
SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x301d5683db40000, likely server has 
closed socket
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
           at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282)
   [INFO] 2025-06-17 17:16:00.878 +0800 o.a.c.f.s.ConnectionStateManager:[252] 
- State change: SUSPENDED
   100.110.18.239 - - [17/Jun/2025:09:16:06 +0000] "GET 
/dolphinscheduler/actuator/prometheus HTTP/1.1" 200 73077 6ms
   
   ```
   
   Finally, the ConnectionState will change to LOST
   
   ```
   [WARN] 2025-06-17 19:18:12.780 +0800 o.a.c.ConnectionState:[316] - Session 
expired event received
   [INFO] 2025-06-17 19:18:12.781 +0800 o.a.c.f.s.ConnectionStateManager:[252] 
- State change: LOST
   
   ```
   
   the worker logs is
   
   ```
   org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session 
timed out, have not heard from server in 20020ms for session id 0x0
           at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1250)
   [WI-0][TI-0] - [WARN] 2025-06-16 15:06:48.593 +0800 
o.a.c.r.ExponentialBackoffRetry:[74] - Sleep extension too large (10000). 
Pinning to 3000
   [WI-0][TI-0] - [ERROR] 2025-06-16 15:06:49.110 +0800 
o.a.d.s.w.r.WorkerWaitingStrategy:[85] - Disconnect from registry and change 
the current status to waiting error, the current server state is WAITING, will 
stop the current server
   org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleException: 
Waiting to reconnect to registry in PT1M40S failed
           at 
org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:78)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at 
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
           at 
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208)
           at com.sun.proxy.$Proxy111.disconnect(Unknown Source)
           at 
org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:53)
           at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:47)
           at 
org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281)
           at 
org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
           at 
org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
           at 
org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
           at 
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281)
           at 
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
           at 
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.dolphinscheduler.registry.api.RegistryException: 
Cannot connect to registry in 100 s
           at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.connectUntilTimeout(ZookeeperRegistry.java:125)
           at 
org.apache.dolphinscheduler.registry.api.RegistryClient.connectUntilTimeout(RegistryClient.java:73)
           at 
org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:75)
           ... 20 common frames omitted
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.120 +0800 
o.a.d.s.w.WorkerServer:[130] - Worker server is stopping, current cause : 
Disconnect from registry and change the current status to waiting error, the 
current server state is WAITING, will stop the current server
   [WI-0][TI-0] - [WARN] 2025-06-16 15:06:52.120 +0800 
o.a.d.c.m.BaseHeartBeatTask:[84] - WorkerHeartBeatTask finished...
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.125 +0800 
o.a.c.f.i.CuratorFrameworkImpl:[998] - backgroundOperationsLoop exiting
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.597 +0800 o.a.z.ClientCnxn:[1171] 
- Opening socket connection to server 
[100.96.39.186/100.96.39.186:2181](http://100.96.39.186/100.96.39.186:2181).
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.597 +0800 o.a.z.ClientCnxn:[1173] 
- SASL config status: Will not attempt to authenticate using SASL (unknown 
error)
   [WI-0][TI-0] - [WARN] 2025-06-16 15:06:52.698 +0800 
o.a.c.r.ExponentialBackoffRetry:[74] - Sleep extension too large (29000). 
Pinning to 3000
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.698 +0800 o.a.z.ClientCnxn:[568] - 
EventThread shut down for session: 0x0
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.698 +0800 o.a.z.ZooKeeper:[1232] - 
Session: 0x0 closed
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.699 +0800 
o.a.d.s.w.r.WorkerRegistryClient:[136] - Worker registry client closed
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.701 +0800 
o.a.d.e.b.s.NettyRemotingServer:[159] - netty server closed
   [WI-0][TI-0] - [INFO] 2025-06-16 15:06:52.701 +0800 
o.a.d.s.w.WorkerServer:[143] - Worker server stopped, current cause: Disconnect 
from registry and change the current status to waiting error, the current 
server state is WAITING, will stop the current server
   [WI-0][TI-0] - [ERROR] 2025-06-16 15:06:55.699 +0800 
o.a.d.c.m.BaseHeartBeatTask:[71] - WorkerHeartBeatTask task execute failed
   org.apache.dolphinscheduler.registry.api.RegistryException: Failed to put 
registry key: 
/nodes/worker/dolphinscheduler-worker-0.dolphinscheduler-worker-headless:1234
           at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.put(ZookeeperRegistry.java:182)
           at 
org.apache.dolphinscheduler.registry.api.RegistryClient.persistEphemeral(RegistryClient.java:174)
           at 
org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask.writeHeartBeat(WorkerHeartBeatTask.java:87)
           at 
org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask.writeHeartBeat(WorkerHeartBeatTask.java:37)
           at 
org.apache.dolphinscheduler.common.model.BaseHeartBeatTask.run(BaseHeartBeatTask.java:67)
   Caused by: java.lang.IllegalStateException: Client is not started
           at 
org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
           at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:139)
           at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:649)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1216)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193)
           at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595)
           at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48)
           at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.put(ZookeeperRegistry.java:180)
           ... 4 common frames omitted
   ```
   
   
   
   ### What you expected to happen
   
   Dolphinscheduler components should reconnect zookeeper pod automatically, 
   
   
   
   ### How to reproduce
   
   Dolphinscheduler on k8s ,when zookeeper pod restart , the probelm will be 
reproduced 100%
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.2.x
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to