JaayYoung opened a new issue, #13315:
URL: https://github.com/apache/dolphinscheduler/issues/13315

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   版本3.0.0,服务正常运行,master突然down了一个节点,这是日志:
   [ERROR] 2022-12-31 05:20:45.000 +0800 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[324] - 
[WorkflowInstance-0][TaskInstance-0] - update master nodes error
   org.apache.dolphinscheduler.registry.api.RegistryException: zookeeper 
release lock error
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:215)
        at 
org.apache.dolphinscheduler.service.registry.RegistryClient.getLock(RegistryClient.java:231)
        at 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:319)
        at 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$800(ServerNodeManager.java:68)
        at 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:303)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:128)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at 
org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at 
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.IOException: Lost connection while trying to acquire 
lock: /lock/masters
        at 
org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:91)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:204)
        ... 18 common frames omitted
   [ERROR] 2022-12-31 05:20:45.000 +0800 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[150] - 
[WorkflowInstance-0][TaskInstance-0] - MASTER server failover failed, 
host:192.168.142.20:5678
   org.apache.dolphinscheduler.registry.api.RegistryException: Failed to put 
registry key: /dead-servers/master_192.168.142.20:5678
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.put(ZookeeperRegistry.java:172)
        at 
org.apache.dolphinscheduler.service.registry.RegistryClient.lambda$handleDeadServer$1(RegistryClient.java:159)
        at java.util.Collections$SingletonSet.forEach(Collections.java:4767)
        at 
org.apache.dolphinscheduler.service.registry.RegistryClient.handleDeadServer(RegistryClient.java:150)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeMasterNodePath(MasterRegistryClient.java:142)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleMasterEvent(MasterRegistryDataListener.java:66)
        at 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:52)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:128)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at 
org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at 
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalStateException: Expected state [STARTED] was 
[STOPPED]
        at 
org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.create(CuratorFrameworkImpl.java:445)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.put(ZookeeperRegistry.java:166)
        ... 20 common frames omitted
   [ERROR] 2022-12-31 05:20:45.000 +0800 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[307] - 
[WorkflowInstance-0][TaskInstance-0] - MasterNodeListener capture data change 
and get data failed.
   java.lang.NullPointerException: null
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.releaseLock(ZookeeperRegistry.java:222)
        at 
org.apache.dolphinscheduler.service.registry.RegistryClient.releaseLock(RegistryClient.java:235)
        at 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:326)
        at 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$800(ServerNodeManager.java:68)
        at 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:303)
        at 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:128)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:760)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:754)
        at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at 
org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at 
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:753)
        at 
org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:75)
        at 
org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:865)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   
   ### What you expected to happen
   
   master and worker works fine
   
   ### How to reproduce
   
   please refer to the log
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.0.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to