alei1206 opened a new issue, #15370:
URL: https://github.com/apache/dolphinscheduler/issues/15370

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When a worker offline, there is a chance that the master will continue to 
send tasks to this worker and throw an exception:
   
   [ERROR] 2023-12-20 22:58:25.317 +0800 
org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper:[87]
 - [WorkflowInstance-0][TaskInstance-0] - Dispatch task failed
   org.apache.dolphinscheduler.server.master.exception.TaskDispatchException: 
Dispatch task to 192.168.1.128:1234 failed
           at 
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:101)
           at 
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.dispatchTask(BaseTaskDispatcher.java:74)
           at 
org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper.run(GlobalTaskDispatchWaitingQueueLooper.java:79)
   Caused by: org.apache.dolphinscheduler.remote.exceptions.RemotingException: 
connect to : Host(ip=192.168.1.128, port=1234) fail
           at 
org.apache.dolphinscheduler.remote.NettyRemotingClient.sendSync(NettyRemotingClient.java:210)
           at 
org.apache.dolphinscheduler.server.master.rpc.MasterRpcClient.sendSyncCommand(MasterRpcClient.java:49)
           at 
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:87)
   
   
![Snipaste_2023-12-27_19-11-47](https://github.com/apache/dolphinscheduler/assets/97011595/62ca51d9-7ab4-4f1a-aa41-ac1201a70f98)
   
   
   ### What you expected to happen
   
   When the worker is offline, clear the worker node information in the master 
and do not attempt to send tasks to it
   
   ### How to reproduce
   
   1. Run master-1 and worker-1 server
   2. Run worker-2 server. All worker services are in the default worker group
   3. kill worker-1 server
   4. Running a workflow consists of a number of tasks
   5. Some exceptions will be printed to the master log
   
   in addition, 
   we can debug master server at 
`org.apache.dolphinscheduler.server.master.registry.ServerNodeManager#updateWorkerNodes()`,
 We can find that when a worker is offline, the master does not clear the 
offline worker information in `workerNodeInfo`
   
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.2.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to