liaotian1005 opened a new issue, #13549:
URL: https://github.com/apache/dolphinscheduler/issues/13549

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Master Service fails to tolerate faults when zookeepr recovers:
   When the zookeepr service is shut down(bin/zkServer.sh stop), the master 
will throw a message indicating that the connection to zookeepr times out. 
   
   ```
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x1005793a4050000, likely server has 
closed socket
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282)
   ```
   
   
   When zookeepr is recovers, the master service is stopped due to a fault 
recovery failure.
   ```
   [ERROR] 2023-02-11 16:23:15.011 +0800 
org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy:[105] 
- Recover from waiting failed, the current server status is RUNNING, will stop 
the server
   org.apache.dolphinscheduler.remote.exceptions.RemoteException: 
NettyRemotingServer bind 5678 fail
        at 
org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144)
        at 
org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer.start(MasterRPCServer.java:108)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reStartMasterResource(MasterWaitingStrategy.java:130)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reconnect(MasterWaitingStrategy.java:97)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener.onUpdate(MasterConnectionStateListener.java:55)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49)
   
   MasterServer shutdown ,due to  that did not recover correctly
   
   ### What you expected to happen
   
   I have fixed the bug so that no bind exception is thrown when the master 
service is failover
   
   ### How to reproduce
   
   Master Service fails to tolerate faults when zookeepr recovers:
   When the zookeepr service is shut down(bin/zkServer.sh stop), the master 
will throw a message indicating that the connection to zookeepr times out. 
   
   ```
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x1005793a4050000, likely server has 
closed socket
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1282)
   ```
   
   
   When zookeepr is recovers, the master service is stopped due to a fault 
recovery failure.
   ```
   [ERROR] 2023-02-11 16:23:15.011 +0800 
org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy:[105] 
- Recover from waiting failed, the current server status is RUNNING, will stop 
the server
   org.apache.dolphinscheduler.remote.exceptions.RemoteException: 
NettyRemotingServer bind 5678 fail
        at 
org.apache.dolphinscheduler.remote.NettyRemotingServer.start(NettyRemotingServer.java:144)
        at 
org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer.start(MasterRPCServer.java:108)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reStartMasterResource(MasterWaitingStrategy.java:130)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reconnect(MasterWaitingStrategy.java:97)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener.onUpdate(MasterConnectionStateListener.java:55)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49)
   
   
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to