jamescheng16 opened a new issue, #12414: URL: https://github.com/apache/dolphinscheduler/issues/12414
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened first, receive worker crash alert, and master crash. Here its the log `[INFO] 2022-10-18 05:07:40.516 +0000 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[129] - [WorkflowInstance-23201][TaskInstance-0] - Success remove workflow instance from timeout check list [INFO] 2022-10-18 05:07:40.516 +0000 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[136] - [WorkflowInstance-23201][TaskInstance-0] - Workflow instance is finished. [INFO] 2022-10-18 05:07:41.565 +0000 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server dolphinscheduler/172.16.18.125:2181. [INFO] 2022-10-18 05:07:41.565 +0000 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error) [INFO] 2022-10-18 05:07:41.566 +0000 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /172.16.18.125:50356, server: dolphinscheduler/172.16.18.125:2181 [INFO] 2022-10-18 05:07:41.568 +0000 org.apache.zookeeper.ClientCnxn:[1444] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server dolphinscheduler/172.16.18.125:2181, session id = 0x10001031c9e005d, negotiated timeout = 40000 [INFO] 2022-10-18 05:07:41.568 +0000 org.apache.curator.framework.state.ConnectionStateManager:[252] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED [INFO] 2022-10-18 05:07:41.568 +0000 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[48] - [WorkflowInstance-0][TaskInstance-0] - Registry reconnected [INFO] 2022-10-18 05:07:41.568 +0000 org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener:[47] - [WorkflowInstance-0][TaskInstance-0] - Master received a RECONNECTED event from registry, the current server state is RUNNING [ERROR] 2022-10-18 05:07:41.568 +0000 org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy:[106] - [WorkflowInstance-0][TaskInstance-0] - Recover from waiting failed, the current server status is RUNNING, will stop the server org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleException: The current server status is not waiting, cannot recover form waiting at org.apache.dolphinscheduler.common.lifecycle.ServerLifeCycleManager.recoverFromWaiting(ServerLifeCycleManager.java:68) at org.apache.dolphinscheduler.server.master.registry.MasterWaitingStrategy.reconnect(MasterWaitingStrategy.java:97) at org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener.onUpdate(MasterConnectionStateListener.java:55) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener.stateChanged(ZookeeperConnectionStateListener.java:49) at org.apache.curator.framework.state.ConnectionStateManager.lambda$processEvents$0(ConnectionStateManager.java:281) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:281) at org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:134) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [INFO] 2022-10-18 05:07:41.569 +0000 org.apache.curator.framework.imps.EnsembleTracker:[201] - [WorkflowInstance-0][TaskInstance-0] - New config event received: {} [INFO] 2022-10-18 05:07:41.574 +0000 org.apache.dolphinscheduler.server.master.task.MasterHeartBeatTask:[70] - [WorkflowInstance-0][TaskInstance-0] - Success write master heartBeatInfo into registry, masterRegistryPath: /nodes/master/172.16.18.125:5678, heartBeatInfo: {"startupTime":1666066687942,"reportTime":1666069617347,"cpuUsage":0.0,"memoryUsage":0.33,"loadAverage":0.0,"availablePhysicalMemorySize":10.51,"maxCpuloadAvg":8.0,"reservedMemory":0.3,"diskAvailable":421.28,"processId":640966} [INFO] 2022-10-18 05:07:42.554 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[81] - [WorkflowInstance-0][TaskInstance-0] - worker node deleted : /nodes/worker/default/172.16.18.127:1234 [INFO] 2022-10-18 05:07:42.554 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[127] - [WorkflowInstance-0][TaskInstance-0] - WORKER node deleted : /nodes/worker/default/172.16.18.127:1234 [INFO] 2022-10-18 05:07:42.558 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[137] - [WorkflowInstance-0][TaskInstance-0] - path: /nodes/worker/default/172.16.18.127:1234 not exists [INFO] 2022-10-18 05:07:42.558 +0000 org.apache.dolphinscheduler.server.master.service.FailoverService:[58] - [WorkflowInstance-0][TaskInstance-0] - Worker failover staring, workerServer: 172.16.18.127:1234 [INFO] 2022-10-18 05:07:42.559 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[97] - [WorkflowInstance-0][TaskInstance-0] - Worker[172.16.18.127:1234] failover starting [INFO] 2022-10-18 05:07:42.559 +0000 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[272] - [WorkflowInstance-0][TaskInstance-0] - worker group node : /nodes/worker/default/172.16.18.127:1234 down. [INFO] 2022-10-18 05:07:42.565 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[109] - [WorkflowInstance-0][TaskInstance-0] - Worker[172.16.18.127:1234] failover there are 4 taskInstance may need to failover, will do a deep check, taskInstanceIds: [24542, 24541, 24540, 24537] [INFO] 2022-10-18 05:07:42.567 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[131] - [WorkflowInstance-23200][TaskInstance-24542] - Worker[172.16.18.127:1234] failover: begin to failover taskInstance, will set the status to NEED_FAULT_TOLERANCE [INFO] 2022-10-18 05:07:42.567 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[165] - [WorkflowInstance-23200][TaskInstance-24542] - The failover taskInstance is not master task [INFO] 2022-10-18 05:07:42.567 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[174] - [WorkflowInstance-23200][TaskInstance-24542] - TaskInstance failover begin kill the task related yarn job [INFO] 2022-10-18 05:07:42.568 +0000 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-23200][TaskInstance-24542] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=23200, taskInstanceId=24542, taskCode=0, status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2022-10-18 05:07:42.569 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[135] - [WorkflowInstance-23200][TaskInstance-24542] - Worker[172.16.18.127:1234] failover: Finish failover taskInstance [INFO] 2022-10-18 05:07:42.569 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[131] - [WorkflowInstance-23202][TaskInstance-24541] - Worker[172.16.18.127:1234] failover: begin to failover taskInstance, will set the status to NEED_FAULT_TOLERANCE [INFO] 2022-10-18 05:07:42.569 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[165] - [WorkflowInstance-23202][TaskInstance-24541] - The failover taskInstance is not master task [INFO] 2022-10-18 05:07:42.569 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[174] - [WorkflowInstance-23202][TaskInstance-24541] - TaskInstance failover begin kill the task related yarn job [INFO] 2022-10-18 05:07:42.570 +0000 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-23202][TaskInstance-24541] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=23202, taskInstanceId=24541, taskCode=0, status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2022-10-18 05:07:42.570 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[135] - [WorkflowInstance-23202][TaskInstance-24541] - Worker[172.16.18.127:1234] failover: Finish failover taskInstance [INFO] 2022-10-18 05:07:42.570 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[131] - [WorkflowInstance-23204][TaskInstance-24540] - Worker[172.16.18.127:1234] failover: begin to failover taskInstance, will set the status to NEED_FAULT_TOLERANCE [INFO] 2022-10-18 05:07:42.570 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[165] - [WorkflowInstance-23204][TaskInstance-24540] - The failover taskInstance is not master task [INFO] 2022-10-18 05:07:42.570 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[174] - [WorkflowInstance-23204][TaskInstance-24540] - TaskInstance failover begin kill the task related yarn job [INFO] 2022-10-18 05:07:42.571 +0000 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-23204][TaskInstance-24540] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=23204, taskInstanceId=24540, taskCode=0, status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2022-10-18 05:07:42.571 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[135] - [WorkflowInstance-23204][TaskInstance-24540] - Worker[172.16.18.127:1234] failover: Finish failover taskInstance [INFO] 2022-10-18 05:07:42.571 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[131] - [WorkflowInstance-23199][TaskInstance-24537] - Worker[172.16.18.127:1234] failover: begin to failover taskInstance, will set the status to NEED_FAULT_TOLERANCE [INFO] 2022-10-18 05:07:42.571 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[165] - [WorkflowInstance-23199][TaskInstance-24537] - The failover taskInstance is not master task [INFO] 2022-10-18 05:07:42.571 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[174] - [WorkflowInstance-23199][TaskInstance-24537] - TaskInstance failover begin kill the task related yarn job [INFO] 2022-10-18 05:07:43.573 +0000 org.apache.dolphinscheduler.service.log.LogClientService:[206] - [WorkflowInstance-23199][TaskInstance-24537] - Begin to get appIds from worker: 172.16.18.127:1234 taskLogPath: /tmp/dolphinscheduler/worker-server/logs/20221018/7185584338369_7-23199-24537.log [WARN] 2022-10-18 05:07:43.583 +0000 org.apache.dolphinscheduler.remote.NettyRemotingClient:[369] - [WorkflowInstance-23199][TaskInstance-24537] - connect to Host{address='172.16.18.127:1234', ip='172.16.18.127', port=1234} error io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /172.16.18.127:1234 Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) at io.netty.channel.unix.Socket.finishConnect(Socket.java:251) at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673) at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650) at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at java.lang.Thread.run(Thread.java:750) [INFO] 2022-10-18 05:07:43.588 +0000 org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - [WorkflowInstance-23199][TaskInstance-24537] - netty client closed [INFO] 2022-10-18 05:07:43.588 +0000 org.apache.dolphinscheduler.service.log.LogClientService:[81] - [WorkflowInstance-23199][TaskInstance-24537] - logger client closed [ERROR] 2022-10-18 05:07:43.588 +0000 org.apache.dolphinscheduler.server.utils.ProcessUtils:[216] - [WorkflowInstance-23199][TaskInstance-24537] - Kill yarn job failure, taskInstanceId: 24537 org.apache.dolphinscheduler.remote.exceptions.RemotingException: connect to : Host{address='172.16.18.127:1234', ip='172.16.18.127', port=1234} fail at org.apache.dolphinscheduler.remote.NettyRemotingClient.sendSync(NettyRemotingClient.java:258) at org.apache.dolphinscheduler.service.log.LogClientService.getAppIds(LogClientService.java:214) at org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:198) at org.apache.dolphinscheduler.server.master.service.WorkerFailoverService.failoverTaskInstance(WorkerFailoverService.java:175) at org.apache.dolphinscheduler.server.master.service.WorkerFailoverService.failoverWorker(WorkerFailoverService.java:134) at org.apache.dolphinscheduler.server.master.service.FailoverService.failoverServerWhenDown(FailoverService.java:59) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeWorkerNodePath(MasterRegistryClient.java:142) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleWorkerEvent(MasterRegistryDataListener.java:82) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:55) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:142) at org.apache.curator.framework.recipes.cache.TreeCache.lambda$callListeners$1(TreeCache.java:811) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:807) at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:79) at org.apache.curator.framework.recipes.cache.TreeCache$2.run(TreeCache.java:909) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [INFO] 2022-10-18 05:07:43.590 +0000 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-23199][TaskInstance-24537] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=23199, taskInstanceId=24537, taskCode=0, status=TaskExecutionStatus{code=8, desc='need fault tolerance'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2022-10-18 05:07:43.590 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[135] - [WorkflowInstance-23199][TaskInstance-24537] - Worker[172.16.18.127:1234] failover: Finish failover taskInstance [INFO] 2022-10-18 05:07:43.590 +0000 org.apache.dolphinscheduler.server.master.service.WorkerFailoverService:[143] - [WorkflowInstance-0][TaskInstance-0] - Worker[172.16.18.127:1234] failover finished, useTime:1031ms [INFO] 2022-10-18 05:07:43.590 +0000 org.apache.dolphinscheduler.server.master.service.FailoverService:[60] - [WorkflowInstance-0][TaskInstance-0] - Worker failover finished, workerServer: 172.16.18.127:1234 [INFO] 2022-10-18 05:07:44.569 +0000 org.apache.dolphinscheduler.server.master.MasterServer:[135] - [WorkflowInstance-0][TaskInstance-0] - Master server is stopping, current cause : Recover from waiting failed, the current server status is RUNNING, will stop the server [INFO] 2022-10-18 05:07:44.573 +0000 org.quartz.core.QuartzScheduler:[585] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_geneseeq1666066688472 paused. [INFO] 2022-10-18 05:07:44.581 +0000 org.eclipse.jetty.server.AbstractConnector:[383] - [WorkflowInstance-0][TaskInstance-0] - Stopped ServerConnector@e07b4db{HTTP/1.1, (http/1.1)}{0.0.0.0:5679} [INFO] 2022-10-18 05:07:44.581 +0000 org.eclipse.jetty.server.session:[149] - [WorkflowInstance-0][TaskInstance-0] - node0 Stopped scavenging [INFO] 2022-10-18 05:07:44.583 +0000 org.eclipse.jetty.server.handler.ContextHandler.application:[2368] - [WorkflowInstance-0][TaskInstance-0] - Destroying Spring FrameworkServlet 'dispatcherServlet' [INFO] 2022-10-18 05:07:44.583 +0000 org.eclipse.jetty.server.handler.ContextHandler:[1159] - [WorkflowInstance-0][TaskInstance-0] - Stopped o.s.b.w.e.j.JettyEmbeddedWebAppContext@37f71c05{application,/,[file:///tmp/jetty-docbase.5679.2442912892556208592/],STOPPED} [INFO] 2022-10-18 05:07:44.588 +0000 org.quartz.core.QuartzScheduler:[666] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_geneseeq1666066688472 shutting down. [INFO] 2022-10-18 05:07:44.588 +0000 org.quartz.core.QuartzScheduler:[585] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_geneseeq1666066688472 paused. [INFO] 2022-10-18 05:07:44.590 +0000 org.quartz.core.QuartzScheduler:[740] - [WorkflowInstance-0][TaskInstance-0] - Scheduler DolphinScheduler_$_geneseeq1666066688472 shutdown complete. [INFO] 2022-10-18 05:07:44.590 +0000 org.springframework.scheduling.quartz.SchedulerFactoryBean:[847] - [WorkflowInstance-0][TaskInstance-0] - Shutting down Quartz Scheduler [INFO] 2022-10-18 05:07:44.591 +0000 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[126] - [WorkflowInstance-0][TaskInstance-0] - Master schedule bootstrap stopping... [INFO] 2022-10-18 05:07:44.591 +0000 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[127] - [WorkflowInstance-0][TaskInstance-0] - Master schedule bootstrap stopped... [INFO] 2022-10-18 05:07:44.592 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[93] - [WorkflowInstance-0][TaskInstance-0] - MASTER node deleted : /nodes/master/172.16.18.125:5678 [INFO] 2022-10-18 05:07:44.592 +0000 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[301] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/172.16.18.125:5678 down. [INFO] 2022-10-18 05:07:44.592 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[108] - [WorkflowInstance-0][TaskInstance-0] - path: /nodes/master/172.16.18.125:5678 not exists [INFO] 2022-10-18 05:07:44.593 +0000 org.apache.dolphinscheduler.server.master.service.FailoverService:[53] - [WorkflowInstance-0][TaskInstance-0] - Master failover starting, masterServer: 172.16.18.125:5678 [INFO] 2022-10-18 05:07:44.592 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[176] - [WorkflowInstance-0][TaskInstance-0] - Master node : 172.16.18.125:5678 unRegistry to register center. [WARN] 2022-10-18 05:07:44.593 +0000 org.apache.dolphinscheduler.common.model.BaseHeartBeatTask:[69] - [WorkflowInstance-0][TaskInstance-0] - MasterHeartBeatTask task finished [INFO] 2022-10-18 05:07:44.600 +0000 org.apache.curator.framework.imps.CuratorFrameworkImpl:[998] - [WorkflowInstance-0][TaskInstance-0] - backgroundOperationsLoop exiting [WARN] 2022-10-18 05:07:44.601 +0000 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[346] - [WorkflowInstance-0][TaskInstance-0] - current addr:172.16.18.125:5678 is not in active master list [INFO] 2022-10-18 05:07:44.601 +0000 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[349] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: 172.16.18.125:5678 [ERROR] 2022-10-18 05:07:44.604 +0000 org.apache.curator.framework.imps.CuratorFrameworkImpl:[733] - [WorkflowInstance-0][TaskInstance-0] - Background exception was not retry-able or retry gave up java.lang.IllegalStateException: Client is not started at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:139) at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:649) at org.apache.curator.framework.imps.WatcherRemovalFacade.getZooKeeper(WatcherRemovalFacade.java:146) at org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground.performBackgroundOperation(FindAndDeleteProtectedNodeInBackground.java:108) at org.apache.curator.framework.imps.OperationAndData.callPerformBackgroundOperation(OperationAndData.java:84) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1008) at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:667) at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) at org.apache.curator.framework.imps.FindAndDeleteProtectedNodeInBackground.execute(FindAndDeleteProtectedNodeInBackground.java:60) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:617) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225) at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237) at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:218) at org.apache.dolphinscheduler.service.registry.RegistryClient.getLock(RegistryClient.java:217) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService.failoverMaster(MasterFailoverService.java:112) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService$$FastClassBySpringCGLIB$$479c980c.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService$$EnhancerBySpringCGLIB$$f5fc50f2.failoverMaster(<generated>) at org.apache.dolphinscheduler.server.master.service.FailoverService.failoverServerWhenDown(FailoverService.java:54) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeMasterNodePath(MasterRegistryClient.java:112) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleMasterEvent(MasterRegistryDataListener.java:66) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:52) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:142) at org.apache.curator.framework.recipes.cache.TreeCache.lambda$callListeners$1(TreeCache.java:811) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:807) at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:79) at org.apache.curator.framework.recipes.cache.TreeCache$2.run(TreeCache.java:909) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [ERROR] 2022-10-18 05:07:44.605 +0000 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[306] - [WorkflowInstance-0][TaskInstance-0] - MasterNodeListener capture data change and get data failed. org.apache.dolphinscheduler.registry.api.RegistryException: zookeeper release lock error at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.releaseLock(ZookeeperRegistry.java:246) at org.apache.dolphinscheduler.service.registry.RegistryClient.releaseLock(RegistryClient.java:221) at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.updateMasterNodes(ServerNodeManager.java:325) at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager.access$900(ServerNodeManager.java:71) at org.apache.dolphinscheduler.server.master.registry.ServerNodeManager$MasterDataListener.notify(ServerNodeManager.java:302) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:142) at org.apache.curator.framework.recipes.cache.TreeCache.lambda$callListeners$1(TreeCache.java:811) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:807) at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:79) at org.apache.curator.framework.recipes.cache.TreeCache$2.run(TreeCache.java:909) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED] at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823) at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:457) at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:477) at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:347) at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:124) at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.releaseLock(ZookeeperRegistry.java:240) ... 17 common frames omitted [ERROR] 2022-10-18 05:07:44.606 +0000 org.apache.dolphinscheduler.server.master.service.MasterFailoverService:[115] - [WorkflowInstance-0][TaskInstance-0] - Master server failover failed, host:172.16.18.125:5678 org.apache.dolphinscheduler.registry.api.RegistryException: zookeeper release lock error at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:229) at org.apache.dolphinscheduler.service.registry.RegistryClient.getLock(RegistryClient.java:217) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService.failoverMaster(MasterFailoverService.java:112) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService$$FastClassBySpringCGLIB$$479c980c.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService$$EnhancerBySpringCGLIB$$f5fc50f2.failoverMaster(<generated>) at org.apache.dolphinscheduler.server.master.service.FailoverService.failoverServerWhenDown(FailoverService.java:54) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeMasterNodePath(MasterRegistryClient.java:112) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleMasterEvent(MasterRegistryDataListener.java:66) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:52) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:142) at org.apache.curator.framework.recipes.cache.TreeCache.lambda$callListeners$1(TreeCache.java:811) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:807) at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:79) at org.apache.curator.framework.recipes.cache.TreeCache$2.run(TreeCache.java:909) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalStateException: Client is not started at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:507) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:139) at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:649) at org.apache.curator.framework.imps.WatcherRemovalFacade.getZooKeeper(WatcherRemovalFacade.java:146) at org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1223) at org.apache.curator.framework.imps.CreateBuilderImpl$18.call(CreateBuilderImpl.java:1193) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1190) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:605) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:595) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:573) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:48) at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54) at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225) at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237) at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.acquireLock(ZookeeperRegistry.java:218) ... 29 common frames omitted [ERROR] 2022-10-18 05:07:44.606 +0000 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[115] - [WorkflowInstance-0][TaskInstance-0] - MASTER server failover failed, host:172.16.18.125:5678 java.lang.NullPointerException: null at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.releaseLock(ZookeeperRegistry.java:236) at org.apache.dolphinscheduler.service.registry.RegistryClient.releaseLock(RegistryClient.java:221) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService.failoverMaster(MasterFailoverService.java:117) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService$$FastClassBySpringCGLIB$$479c980c.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) at org.apache.dolphinscheduler.server.master.service.MasterFailoverService$$EnhancerBySpringCGLIB$$f5fc50f2.failoverMaster(<generated>) at org.apache.dolphinscheduler.server.master.service.FailoverService.failoverServerWhenDown(FailoverService.java:54) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.removeMasterNodePath(MasterRegistryClient.java:112) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.handleMasterEvent(MasterRegistryDataListener.java:66) at org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener.notify(MasterRegistryDataListener.java:52) at org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperRegistry.lambda$subscribe$1(ZookeeperRegistry.java:142) at org.apache.curator.framework.recipes.cache.TreeCache.lambda$callListeners$1(TreeCache.java:811) at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:92) at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:89) at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:89) at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:807) at org.apache.curator.framework.recipes.cache.TreeCache.access$1900(TreeCache.java:79) at org.apache.curator.framework.recipes.cache.TreeCache$2.run(TreeCache.java:909) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [INFO] 2022-10-18 05:07:44.708 +0000 org.apache.zookeeper.ClientCnxn:[568] - [WorkflowInstance-0][TaskInstance-0] - EventThread shut down for session: 0x10001031c9e005d [INFO] 2022-10-18 05:07:44.708 +0000 org.apache.zookeeper.ZooKeeper:[1232] - [WorkflowInstance-0][TaskInstance-0] - Session: 0x10001031c9e005d closed [INFO] 2022-10-18 05:07:44.708 +0000 org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer:[114] - [WorkflowInstance-0][TaskInstance-0] - Closing Master RPC Server... [INFO] 2022-10-18 05:07:44.708 +0000 org.apache.dolphinscheduler.remote.NettyRemotingServer:[212] - [WorkflowInstance-0][TaskInstance-0] - netty server closed [INFO] 2022-10-18 05:07:44.708 +0000 org.apache.dolphinscheduler.server.master.rpc.MasterRPCServer:[116] - [WorkflowInstance-0][TaskInstance-0] - Closed Master RPC Server... [WARN] 2022-10-18 05:07:44.709 +0000 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[123] - [WorkflowInstance-0][TaskInstance-0] - State event loop service interrupted, will stop this loop java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService$StateEventResponseWorker.run(StateEventResponseService.java:118) [INFO] 2022-10-18 05:07:44.709 +0000 org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[130] - [WorkflowInstance-0][TaskInstance-0] - State event loop service stopped [INFO] 2022-10-18 05:07:44.711 +0000 org.apache.dolphinscheduler.server.master.processor.queue.TaskEventService:[125] - [WorkflowInstance-0][TaskInstance-0] - StateEventResponseWorker stopped [INFO] 2022-10-18 05:07:44.745 +0000 com.zaxxer.hikari.HikariDataSource:[350] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown initiated... [INFO] 2022-10-18 05:07:44.750 +0000 com.zaxxer.hikari.HikariDataSource:[352] - [WorkflowInstance-0][TaskInstance-0] - DolphinScheduler - Shutdown completed.` ### What you expected to happen master and worker works fine ### How to reproduce please refer to the log ### Anything else _No response_ ### Version 3.1.x ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
