hokie-chan opened a new issue, #13310:
URL: https://github.com/apache/dolphinscheduler/issues/13310

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When worker registry (zookeeper) client recover from SUSPENDED to 
RECONNECTED, the worker itself will try to restart, but actually the worker is 
still normally running, then will cause 
"org.apache.dolphinscheduler.remote.exceptions.RemoteException: 
NettyRemotingServer bind 1234 fail", and finally cause the worker node hang.
   Please refer to "Anything else" section for logs.
   
   ### What you expected to happen
   
   When worker registry (zookeeper) client recover from SUSPENDED to 
RECONNECTED, the worker itself no need to restart, just do nothing, and of 
course, will not crash.
   
   ### How to reproduce
   
   Block the network connection from worker node to zookeeper server 2181 port, 
when we can see the "client timedout" log, the worker registry client will 
change to SUSPENDED status, 
   then resume the network connection, then the worker registry client will 
reconnect to zookeeper server, and status will change to RECONNECTED. Now we 
can see the exception happen.
   
   ### Anything else
   
   ```  
   [INFO] 2022-12-26 11:06:35.672 +0800 
org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask:[89] - 
[WorkflowInstance-0][TaskInstance-0] - Success write worker group heartBeatInfo 
into registry, workerRegistryPath: /nodes/worker/10.x.x.2:1234 
workerHeartBeatInfo: 
{"startupTime":1671855451894,"reportTime":1672023995663,"cpuUsage":0.01,"memoryUsage":0.72,"loadAverage":0.0,"availablePhysicalMemorySize":2.14,"maxCpuloadAvg":4.0,"reservedMemory":0.03,"diskAvailable":28.47,"serverStatus":0,"processId":5576,"workerHostWeight":100,"workerWaitingTaskCount":0,"workerExecThreadCount":100}
   [WARN] 2022-12-26 11:06:55.680 +0800 org.apache.zookeeper.ClientCnxn:[1249] 
- [WorkflowInstance-0][TaskInstance-0] - Client session timed out, have not 
heard from server in 20008ms for session id 0x1028fe2bbbc003f
   [WARN] 2022-12-26 11:06:55.683 +0800 org.apache.zookeeper.ClientCnxn:[1292] 
- [WorkflowInstance-0][TaskInstance-0] - Session 0x1028fe2bbbc003f for server 
ds-1/10.x.x.1:2181, Closing socket connection. Attempting reconnect except it 
is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session 
timed out, have not heard from server in 20008ms for session id 
0x1028fe2bbbc003f
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1250)
   [INFO] 2022-12-26 11:06:55.785 +0800 
org.apache.curator.framework.state.ConnectionStateManager:[252] - 
[WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED
   [WARN] 2022-12-26 11:06:55.786 +0800 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52]
 - [WorkflowInstance-0][TaskInstance-0] - Registry suspended
   [INFO] 2022-12-26 11:06:55.787 +0800 
org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[46]
 - [WorkflowInstance-0][TaskInstance-0] - Worker received a SUSPENDED event 
from registry, the current server state is RUNNING
   [INFO] 2022-12-26 11:06:57.539 +0800 org.apache.zookeeper.ClientCnxn:[1171] 
- [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 
ds-1/10.x.x.1:2181.
   [INFO] 2022-12-26 11:06:57.540 +0800 org.apache.zookeeper.ClientCnxn:[1173] 
- [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt 
to authenticate using SASL (unknown error)
   [INFO] 2022-12-26 11:06:57.543 +0800 org.apache.zookeeper.ClientCnxn:[1005] 
- [WorkflowInstance-0][TaskInstance-0] - Socket connection established, 
initiating session, client: /10.x.x.2:57590, server: ds-1/10.x.x.1:2181
   [INFO] 2022-12-26 11:06:57.567 +0800 org.apache.zookeeper.ClientCnxn:[1444] 
- [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on 
server ds-1/10.x.x.1:2181, session id = 0x1028fe2bbbc003f, negotiated timeout = 
30000
   [INFO] 2022-12-26 11:06:57.568 +0800 
org.apache.curator.framework.state.ConnectionStateManager:[252] - 
[WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
   [INFO] 2022-12-26 11:06:57.568 +0800 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[48]
 - [WorkflowInstance-0][TaskInstance-0] - Registry reconnected
   [INFO] 2022-12-26 11:06:57.568 +0800 
org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener:[46]
 - [WorkflowInstance-0][TaskInstance-0] - Worker received a RECONNECTED event 
from registry, the current server state is RUNNING
   [WARN] 2022-12-26 11:06:57.569 +0800 
org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleManager:[73] - 
[WorkflowInstance-0][TaskInstance-0] - The current server status is already 
running, cannot recover form waiting
   [INFO] 2022-12-26 11:06:57.569 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[75] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc server starting
   [ERROR] 2022-12-26 11:06:57.573 +0800 
org.apache.dolphinscheduler.remote.NettyRemotingServer:[143] - 
[WorkflowInstance-0][TaskInstance-0] - NettyRemotingServer bind fail bind(..) 
failed: Address already in use, exit
   io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address 
already in use
   [ERROR] 2022-12-26 11:06:57.573 +0800 
org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy:[104] 
- [WorkflowInstance-0][TaskInstance-0] - Recover from waiting failed, the 
current server status is RUNNING, will stop the server
   org.apache.dolphinscheduler.remote.exceptions.RemoteException: 
NettyRemotingServer bind 1234 fail
        at 
org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144)
        at 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer.start(WorkerRpcServer.java:93)
        at 
org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reStartWorkerResource(WorkerWaitingStrategy.java:130)
        at 
org.apache.dolphinscheduler.server.worker.registry.WorkerWaitingStrategy.reconnect(WorkerWaitingStrategy.java:97)
        at 
org.apache.dolphinscheduler.server.worker.registry.WorkerConnectionStateListener.onUpdate(WorkerConnectionStateListener.java:54)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49)
        at 
org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281)
        at 
org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92)
        at 
org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89)
        at 
org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89)
        at 
org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281)
        at 
org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
        at 
org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   [INFO] 2022-12-26 11:06:57.577 +0800 
org.apache.dolphinscheduler.server.worker.task.WorkerHeartBeatTask:[89] - 
[WorkflowInstance-0][TaskInstance-0] - Success write worker group heartBeatInfo 
into registry, workerRegistryPath: /nodes/worker/10.x.x.2:1234 
workerHeartBeatInfo: 
{"startupTime":1671855451894,"reportTime":1672024005673,"cpuUsage":0.01,"memoryUsage":0.72,"loadAverage":0.0,"availablePhysicalMemorySize":2.14,"maxCpuloadAvg":4.0,"reservedMemory":0.03,"diskAvailable":28.47,"serverStatus":0,"processId":5576,"workerHostWeight":100,"workerWaitingTaskCount":0,"workerExecThreadCount":100}
   [INFO] 2022-12-26 11:07:00.573 +0800 
org.apache.dolphinscheduler.server.worker.WorkerServer:[148] - 
[WorkflowInstance-0][TaskInstance-0] - Worker server is stopping, current cause 
: Recover from waiting failed, the current server status is RUNNING, will stop 
the server
   [INFO] 2022-12-26 11:07:00.594 +0800 
org.eclipse.jetty.server.AbstractConnector:[383] - 
[WorkflowInstance-0][TaskInstance-0] - Stopped 
ServerConnector@1682e6a{HTTP/1.1, (http/1.1)}{0.0.0.0:1235}
   [INFO] 2022-12-26 11:07:00.595 +0800 org.eclipse.jetty.server.session:[149] 
- [WorkflowInstance-0][TaskInstance-0] - node0 Stopped scavenging
   [INFO] 2022-12-26 11:07:00.596 +0800 
org.eclipse.jetty.server.handler.ContextHandler.application:[2368] - 
[WorkflowInstance-0][TaskInstance-0] - Destroying Spring FrameworkServlet 
'dispatcherServlet'
   [INFO] 2022-12-26 11:07:00.597 +0800 
org.eclipse.jetty.server.handler.ContextHandler:[1159] - 
[WorkflowInstance-0][TaskInstance-0] - Stopped 
o.s.b.w.e.j.JettyEmbeddedWebAppContext@6abb7b7d{application,/,[file:///tmp/jetty-docbase.1235.4335247786354365438/],STOPPED}
   [WARN] 2022-12-26 11:07:00.605 +0800 
org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - 
[WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask task finished
   [INFO] 2022-12-26 11:07:00.607 +0800 
org.apache.curator.framework.imps.CuratorFrameworkImpl:[998] - 
[WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting
   [INFO] 2022-12-26 11:07:00.721 +0800 org.apache.zookeeper.ZooKeeper:[1232] - 
[WorkflowInstance-0][TaskInstance-0] - Session: 0x1028fe2bbbc003f closed
   [INFO] 2022-12-26 11:07:00.721 +0800 
org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[111] - 
[WorkflowInstance-0][TaskInstance-0] - Worker registry client closed
   [INFO] 2022-12-26 11:07:00.722 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[99] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing
   [INFO] 2022-12-26 11:07:00.726 +0800 
org.apache.dolphinscheduler.remote.NettyRemotingServer:[212] - 
[WorkflowInstance-0][TaskInstance-0] - netty server closed
   [INFO] 2022-12-26 11:07:00.726 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[101] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed
   [INFO] 2022-12-26 11:07:00.729 +0800 org.apache.zookeeper.ClientCnxn:[568] - 
[WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 
0x1028fe2bbbc003f
   [INFO] 2022-12-26 11:07:00.730 +0800 
org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - 
[WorkflowInstance-0][TaskInstance-0] - netty client closed
   [INFO] 2022-12-26 11:07:00.730 +0800 
org.apache.dolphinscheduler.service.log.LogClient:[229] - 
[WorkflowInstance-0][TaskInstance-0] - LogClientService closed
   [INFO] 2022-12-26 11:07:00.730 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcClient:[71] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc client closing
   [INFO] 2022-12-26 11:07:00.733 +0800 
org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - 
[WorkflowInstance-0][TaskInstance-0] - netty client closed
   [INFO] 2022-12-26 11:07:00.733 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcClient:[73] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc client closed
   [WARN] 2022-12-26 11:07:00.733 +0800 
org.apache.dolphinscheduler.service.alert.AlertClientService:[76] - 
[WorkflowInstance-0][TaskInstance-0] - Alert client is already closed
   [INFO] 2022-12-26 11:07:00.735 +0800 
com.zaxxer.hikari.HikariDataSource:[350] - [WorkflowInstance-0][TaskInstance-0] 
- DolphinScheduler - Shutdown initiated...
   [INFO] 2022-12-26 11:07:00.745 +0800 
com.zaxxer.hikari.HikariDataSource:[352] - [WorkflowInstance-0][TaskInstance-0] 
- DolphinScheduler - Shutdown completed.
   [INFO] 2022-12-26 11:07:00.747 +0800 
org.apache.dolphinscheduler.service.alert.AlertClientService:[80] - 
[WorkflowInstance-0][TaskInstance-0] - Alter client closing
   [INFO] 2022-12-26 11:07:00.748 +0800 
org.apache.dolphinscheduler.remote.NettyRemotingClient:[334] - 
[WorkflowInstance-0][TaskInstance-0] - netty client closed
   [INFO] 2022-12-26 11:07:00.748 +0800 
org.apache.dolphinscheduler.service.alert.AlertClientService:[82] - 
[WorkflowInstance-0][TaskInstance-0] - Alter client closed
   [WARN] 2022-12-26 11:07:00.748 +0800 
org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - 
[WorkflowInstance-0][TaskInstance-0] - WorkerHeartBeatTask task finished
   [INFO] 2022-12-26 11:07:00.749 +0800 
org.apache.dolphinscheduler.server.worker.registry.WorkerRegistryClient:[111] - 
[WorkflowInstance-0][TaskInstance-0] - Worker registry client closed
   [INFO] 2022-12-26 11:07:00.749 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[99] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc server closing
   [INFO] 2022-12-26 11:07:00.749 +0800 
org.apache.dolphinscheduler.server.worker.rpc.WorkerRpcServer:[101] - 
[WorkflowInstance-0][TaskInstance-0] - Worker rpc server closed
   [INFO] 2022-12-26 11:07:00.749 +0800 
org.apache.dolphinscheduler.server.worker.WorkerServer:[155] - 
[WorkflowInstance-0][TaskInstance-0] - Worker server stopped, current cause: 
Recover from waiting failed, the current server status is RUNNING, will stop 
the server
   ```
   
   ### Version
   
   3.1.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to