liaotian1005 opened a new issue, #13549: URL: https://github.com/apache/dolphinscheduler/issues/13549
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened Master Service fails to tolerate faults when zookeepr recovers: When the zookeepr service is shut down(bin/zkServer.sh stop), the master will throw a message indicating that the connection to zookeepr times out. ``` org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x1005793a4050000, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) ``` When zookeepr is recovers, the master service is stopped due to a fault recovery failure. ``` [ERROR] 2023-02-11 16:23:15.011 +0800 org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy:[105] - Recover from waiting failed, the current server status is RUNNING, will stop the server org.apache.dolphinscheduler.remote.exceptions.RemoteException: NettyRemotingServer bind 5678 fail at org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144) at org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer.start(MasterRPCServer.java:108) at org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reStartMasterResource(MasterWaitingStrategy.java:130) at org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reconnect(MasterWaitingStrategy.java:97) at org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener.onUpdate(MasterConnectionStateListener.java:55) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49) MasterServer shutdown ,due to that did not recover correctly ### What you expected to happen I have fixed the bug so that no bind exception is thrown when the master service is failover ### How to reproduce Master Service fails to tolerate faults when zookeepr recovers: When the zookeepr service is shut down(bin/zkServer.sh stop), the master will throw a message indicating that the connection to zookeepr times out. ``` org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x1005793a4050000, likely server has closed socket at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282) ``` When zookeepr is recovers, the master service is stopped due to a fault recovery failure. ``` [ERROR] 2023-02-11 16:23:15.011 +0800 org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy:[105] - Recover from waiting failed, the current server status is RUNNING, will stop the server org.apache.dolphinscheduler.remote.exceptions.RemoteException: NettyRemotingServer bind 5678 fail at org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144) at org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer.start(MasterRPCServer.java:108) at org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reStartMasterResource(MasterWaitingStrategy.java:130) at org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reconnect(MasterWaitingStrategy.java:97) at org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener.onUpdate(MasterConnectionStateListener.java:55) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49) ### Anything else _No response_ ### Version dev ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
