ruanwenjun opened a new issue #6045:
URL: https://github.com/apache/dolphinscheduler/issues/6045


   **Describe the bug**
   When a master/worker session is lost from zookeeper, the Ephemeral node will 
be removed by zookeeper.
   And the failover node will do the failover job, it will create a node in 
zookeeper dead-server path.
   The `HeartBeatTask` of the dead server will check it has been in the 
dead-server, and it will kill itself. 
   
   When it stops, it will unregister from the registry(zookeeper).
   
https://github.com/apache/dolphinscheduler/blob/04720b327aef0649e9317573680874c20ea20ad5/dolphinscheduler-server/src/main/java/org/apache/dolphinscheduler/server/master/registry/MasterRegistryClient.java#L371-L379
   
   
https://github.com/apache/dolphinscheduler/blob/04720b327aef0649e9317573680874c20ea20ad5/dolphinscheduler-registry-plugin/dolphinscheduler-registry-zookeeper/src/main/java/org/apache/dolphinscheduler/plugin/registry/zookeeper/ZookeeperRegistry.java#L197-L204
   
   The problem is that the dead server's Ephemeral node is already removed, 
when we use `client.delete()` it will throw an KeeperException.NoNodeException.
   
   
   **To Reproduce**
   1. Start a master server
   2. delete the Ephemeral node to simulate the master is dead.
   3. see exception, and the master cannot shut down, since some other thread 
cannot exist.
   
   
   **Which version of Dolphin Scheduler:**
    -[dev]
   
   **Requirement or improvement**
   Ignore the KeeperException.NoNodeException, if the node is not exist, we can 
think the node has been deleted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to