hokie-chan opened a new issue, #13310: URL: https://github.com/apache/dolphinscheduler/issues/13310
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened When worker registry (zookeeper) client recover from SUSPENDED to RECONNECTED, the worker itself will try to restart, but actually the worker is still normally running, then will cause "org.apache.dolphinscheduler.remote.exceptions.RemoteException: NettyRemotingServer bind 1234 fail", and finally cause the worker node hang. Please refer to "Anything else" section for logs. ### What you expected to happen When worker registry (zookeeper) client recover from SUSPENDED to RECONNECTED, the worker itself no need to restart, just do nothing, and of course, will not crash. ### How to reproduce Block the network connection from worker node to zookeeper server 2181 port, when we can see the "client timedout" log, the worker registry client will change to SUSPENDED status, then resume the network connection, then the worker registry client will reconnect to zookeeper server, and status will change to RECONNECTED. Now we can see the exception happen. ### Anything else ``` [INFO] 2022-12-26 11:06:35.672 +0800 org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask:[89] - [WorkflowInstance-0][TaskInstance-0] - Success write worker group heartBeatInfo into registry, workerRegistryPath: /nodes/worker/10.x.x.2:1234 workerHeartBeatInfo: {"startupTime":1671855451894,"reportTime":1672023995663,"cpuUsage":0.01,"memoryUsage":0.72,"loadAverage":0.0,"availablePhysicalMemorySize":2.14,"maxCpuloadAvg":4.0,"reservedMemory":0.03,"diskAvailable":28.47,"serverStatus":0,"processId":5576,"workerHostWeight":100,"workerWaitingTaskCount":0,"workerExecThreadCount":100} [WARN] 2022-12-26 11:06:55.680 +0800 org.apache.zookeeper.ClientCnxn:[1249] - [WorkflowInstance-0][TaskInstance-0] - Client session timed out, have not heard from server in 20008ms for session id 0x1028fe2bbbc003f [WARN] 2022-12-26 11:06:55.683 +0800 org.apache.zookeeper.ClientCnxn:[1292] - [WorkflowInstance-0][TaskInstance-0] - Session 0x1028fe2bbbc003f for server ds-1/10.x.x.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 20008ms for session id 0x1028fe2bbbc003f at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1250) [INFO] 2022-12-26 11:06:55.785 +0800 org.apache.curator.framework.state.ConnectionStateManager:[252] - [WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED [WARN] 2022-12-26 11:06:55.786 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52] - [WorkflowInstance-0][TaskInstance-0] - Registry suspended [INFO] 2022-12-26 11:06:55.787 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[46] - [WorkflowInstance-0][TaskInstance-0] - Worker received a SUSPENDED event from registry, the current server state is RUNNING [INFO] 2022-12-26 11:06:57.539 +0800 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server ds-1/10.x.x.1:2181. [INFO] 2022-12-26 11:06:57.540 +0800 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [INFO] 2022-12-26 11:06:57.543 +0800 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /10.x.x.2:57590, server: ds-1/10.x.x.1:2181 [INFO] 2022-12-26 11:06:57.567 +0800 org.apache.zookeeper.ClientCnxn:[1444] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server ds-1/10.x.x.1:2181, session id = 0x1028fe2bbbc003f, negotiated timeout = 30000 [INFO] 2022-12-26 11:06:57.568 +0800 org.apache.curator.framework.state.ConnectionStateManager:[252] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED [INFO] 2022-12-26 11:06:57.568 +0800 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[48] - [WorkflowInstance-0][TaskInstance-0] - Registry reconnected [INFO] 2022-12-26 11:06:57.568 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[46] - [WorkflowInstance-0][TaskInstance-0] - Worker received a RECONNECTED event from registry, the current server state is RUNNING [WARN] 2022-12-26 11:06:57.569 +0800 org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleManager:[73] - [WorkflowInstance-0][TaskInstance-0] - The current server status is already running, cannot recover form waiting [INFO] 2022-12-26 11:06:57.569 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[75] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server starting [ERROR] 2022-12-26 11:06:57.573 +0800 org.apache.dolphinscheduler.remote.NettyRemotingServer:[143] - [WorkflowInstance-0][TaskInstance-0] - NettyRemotingServer bind fail bind(..) failed: Address already in use, exit io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use [ERROR] 2022-12-26 11:06:57.573 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[104] - [WorkflowInstance-0][TaskInstance-0] - Recover from waiting failed, the current server status is RUNNING, will stop the server org.apache.dolphinscheduler.remote.exceptions.RemoteException: NettyRemotingServer bind 1234 fail at org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144) at org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer.start(WorkerRpcServer.java:93) at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reStartWorkerResource(WorkerWaitingStrategy.java:130) at org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reconnect(WorkerWaitingStrategy.java:97) at org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:54) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49) at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281) at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [INFO] 2022-12-26 11:06:57.577 +0800 org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask:[89] - [WorkflowInstance-0][TaskInstance-0] - Success write worker group heartBeatInfo into registry, workerRegistryPath: /nodes/worker/10.x.x.2:1234 workerHeartBeatInfo: {"startupTime":1671855451894,"reportTime":1672024005673,"cpuUsage":0.01,"memoryUsage":0.72,"loadAverage":0.0,"availablePhysicalMemorySize":2.14,"maxCpuloadAvg":4.0,"reservedMemory":0.03,"diskAvailable":28.47,"serverStatus":0,"processId":5576,"workerHostWeight":100,"workerWaitingTaskCount":0,"workerExecThreadCount":100} [INFO] 2022-12-26 11:07:00.573 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[148] - [WorkflowInstance-0][TaskInstance-0] - Worker server is stopping, current cause : Recover from waiting failed, the current server status is RUNNING, will stop the server [INFO] 2022-12-26 11:07:00.594 +0800 org.eclipse.jetty.server.AbstractConnector:[383] - [WorkflowInstance-0][TaskInstance-0] - Stopped ServerConnector@1682e6a{HTTP/1.1, (http/1.1)}{0.0.0.0:1235} [INFO] 2022-12-26 11:07:00.595 +0800 org.eclipse.jetty.server.session:[149] - [WorkflowInstance-0][TaskInstance-0] - node0 Stopped scavenging [INFO] 2022-12-26 11:07:00.596 +0800 org.eclipse.jetty.server.handler.ContextHandler.application:[2368] - [WorkflowInstance-0][TaskInstance-0] - Destroying Spring FrameworkServlet 'dispatcherServlet' [INFO] 2022-12-26 11:07:00.597 +0800 org.eclipse.jetty.server.handler.ContextHandler:[1159] - [WorkflowInstance-0][TaskInstance-0] - Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@6abb7b7d{application,/,[file:///tmp/jetty-docbase.1235.4335247786354365438/],STOPPED} [WARN] 2022-12-26 11:07:00.605 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - [WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask task finished [INFO] 2022-12-26 11:07:00.607 +0800 org.apache.curator.framework.imps.CuratorFrameworkImpl:[998] - [WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting [INFO] 2022-12-26 11:07:00.721 +0800 org.apache.zookeeper.ZooKeeper:[1232] - [WorkflowInstance-0][TaskInstance-0] - Session: 0x1028fe2bbbc003f closed [INFO] 2022-12-26 11:07:00.721 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[111] - [WorkflowInstance-0][TaskInstance-0] - Worker registry client closed [INFO] 2022-12-26 11:07:00.722 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[99] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing [INFO] 2022-12-26 11:07:00.726 +0800 org.apache.dolphinscheduler.remote.NettyRemotingServer:[212] - [WorkflowInstance-0][TaskInstance-0] - netty server closed [INFO] 2022-12-26 11:07:00.726 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[101] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed [INFO] 2022-12-26 11:07:00.729 +0800 org.apache.zookeeper.ClientCnxn:[568] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x1028fe2bbbc003f [INFO] 2022-12-26 11:07:00.730 +0800 org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - [WorkflowInstance-0][TaskInstance-0] - netty client closed [INFO] 2022-12-26 11:07:00.730 +0800 org.apache.dolphinscheduler.service.log.LogClient:[229] - [WorkflowInstance-0][TaskInstance-0] - LogClientService closed [INFO] 2022-12-26 11:07:00.730 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcClient:[71] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc client closing [INFO] 2022-12-26 11:07:00.733 +0800 org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - [WorkflowInstance-0][TaskInstance-0] - netty client closed [INFO] 2022-12-26 11:07:00.733 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcClient:[73] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc client closed [WARN] 2022-12-26 11:07:00.733 +0800 org.apache.dolphinscheduler.service.alert.AlertClientService:[76] - [WorkflowInstance-0][TaskInstance-0] - Alert client is already closed [INFO] 2022-12-26 11:07:00.735 +0800 com.zaxxer.hikari.HikariDataSource:[350] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown initiated... [INFO] 2022-12-26 11:07:00.745 +0800 com.zaxxer.hikari.HikariDataSource:[352] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown completed. [INFO] 2022-12-26 11:07:00.747 +0800 org.apache.dolphinscheduler.service.alert.AlertClientService:[80] - [WorkflowInstance-0][TaskInstance-0] - Alter client closing [INFO] 2022-12-26 11:07:00.748 +0800 org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - [WorkflowInstance-0][TaskInstance-0] - netty client closed [INFO] 2022-12-26 11:07:00.748 +0800 org.apache.dolphinscheduler.service.alert.AlertClientService:[82] - [WorkflowInstance-0][TaskInstance-0] - Alter client closed [WARN] 2022-12-26 11:07:00.748 +0800 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - [WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask task finished [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[111] - [WorkflowInstance-0][TaskInstance-0] - Worker registry client closed [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[99] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[101] - [WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed [INFO] 2022-12-26 11:07:00.749 +0800 org.apache.dolphinscheduler.server.worker.WorkerServer:[155] - [WorkflowInstance-0][TaskInstance-0] - Worker server stopped, current cause: Recover from waiting failed, the current server status is RUNNING, will stop the server ``` ### Version 3.1.x ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
