q4q5q6qw opened a new issue, #15666: URL: https://github.com/apache/dolphinscheduler/issues/15666
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened After the system runs properly for a period of time, port 1234 of the worker node is disabled,the master is normal, I checked the worker configuration item registry-disconnect-strategy.strategy=waiting, registry-disconnect-strategy.max-waiting-time=100s by referring to issue:12414, and the following error information is displayed in the log: `[INFO] 2024-03-03 08:07:42.285 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.75.195.147:54462, server: 10.75.194.59/10.75.194.59:2181 [WARN] 2024-03-03 08:07:42.286 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.194.59/10.75.194.59:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:42.856 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[52] - [WorkflowInstance-0][TaskInstance-0] - The current server status is WAITING, will not write heartBeatInfo into registry [ERROR] 2024-03-03 08:07:43.475 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[80] - [WorkflowInstance-0][TaskInstance-0] - Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleException: Waiting to reconnect to registry in PT1M40S failed at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208) at com.sun.proxy.$Proxy110.disconnect(Unknown Source) at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:53) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:43) at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281) at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.dolphinscheduler.registry.api.RegistryException: Cannot connect to the Zookeeper registry in 100 s at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.connectUntilTimeout(ZookeeperRegistry.java:133) at org.apache.dolphinscheduler.registry.api.RegistryClient.connectUntilTimeout(RegistryClient.java:72) at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.disconnect(WorkerWaitingStrategy.java:71) ... 20 common frames omitted [INFO] 2024-03-03 08:07:43.668 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 10.75.195.160/10.75.195.160:2181. [INFO] 2024-03-03 08:07:43.668 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [WARN] 2024-03-03 08:07:43.679 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.195.160/10.75.195.160:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:44.683 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 10.75.195.174/10.75.195.174:2181. [INFO] 2024-03-03 08:07:44.684 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [INFO] 2024-03-03 08:07:44.684 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.75.195.147:39546, server: 10.75.195.174/10.75.195.174:2181 [WARN] 2024-03-03 08:07:44.685 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.195.174/10.75.195.174:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:45.217 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 10.75.194.59/10.75.194.59:2181. [INFO] 2024-03-03 08:07:45.217 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [INFO] 2024-03-03 08:07:45.217 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.75.195.147:54470, server: 10.75.194.59/10.75.194.59:2181 [WARN] 2024-03-03 08:07:45.218 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x0 for server 10.75.194.59/10.75.194.59:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) [INFO] 2024-03-03 08:07:46.479 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[126] - [WorkflowInstance-0][TaskInstance-0] - Worker server is stopping, current cause : Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server [INFO] 2024-03-03 08:07:46.479 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[146] - [WorkflowInstance-0][TaskInstance-0] - Worker begin to kill all cache task, task size: 2 [INFO] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[92] - [WorkflowInstance-18203][TaskInstance-18204] - Begin kill task instance, processId: 0 [ERROR] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[95] - [WorkflowInstance-18203][TaskInstance-18204] - Task instance kill failed, processId is not exist [INFO] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[92] - [WorkflowInstance-18206][TaskInstance-18206] - Begin kill task instance, processId: 0 [ERROR] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils:[95] - [WorkflowInstance-18206][TaskInstance-18206] - Task instance kill failed, processId is not exist [INFO] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[159] - [WorkflowInstance-0][TaskInstance-0] - Worker after kill all cache task, task size: 2, killed number: 0 [WARN] 2024-03-03 08:07:46.480 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[72] - [WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask finished... [INFO] 2024-03-03 08:07:46.515 +0800 org.apache.curator.framework.imps.CuratorFrameworkImpl:[998] - [WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting [WARN] 2024-03-03 08:07:47.007 +0800 org.apache.zookeeper.ClientCnxn:[1286] - [WorkflowInstance-0][TaskInstance-0] - An exception was thrown while closing send thread for session 0x0. java.io.IOException: Connection has already been closed and reconnection is not allowed at org.apache.zookeeper.ClientCnxn$SendThread.changeZkState(ClientCnxn.java:990) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1141) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1200) [INFO] 2024-03-03 08:07:47.116 +0800 org.apache.zookeeper.ClientCnxn:[568] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x0 [INFO] 2024-03-03 08:07:47.116 +0800 org.apache.zookeeper.ZooKeeper:[1232] - [WorkflowInstance-0][TaskInstance-0] - Session: 0x0 closed [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[128] - [WorkflowInstance-0][TaskInstance-0] - Worker registry client closed [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[60] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[62] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed [INFO] 2024-03-03 08:07:47.119 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[133] - [WorkflowInstance-0][TaskInstance-0] - Worker server stopped, current cause: Disconnect from registry and change the current status to waiting error, the current server state is WAITING, will stop the current server` - [ ] ### What you expected to happen Normal scheduling ### How to reproduce Reference Log Information ### Anything else _No response_ ### Version 3.2.x ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
