KEN-LJQ opened a new issue, #13283: URL: https://github.com/apache/dolphinscheduler/issues/13283
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened We have a long run mapreduce task, which will duration a few hours, like 12hours. After we doing worker and master rolling restart, this mr task keep submit to yarn every 10 mins, and the old one failed to kill as we found the error log. Finally, this task repeat submit many mr instance cause out yarn cluster run out of resources. Our version is 2.0.7 After, we diging into the code, we found something about the worker failover within `MasterRegistryClient` ``` private boolean checkTaskAfterWorkerStart(List<Server> workerServers, TaskInstance taskInstance) { if (StringUtils.isEmpty(taskInstance.getHost())) { return false; } Date taskTime = taskInstance.getStartTime() == null ? taskInstance.getSubmitTime() : taskInstance.getStartTime(); Date workerServerStartDate = getServerStartupTime(workerServers, taskInstance.getHost()); if (workerServerStartDate != null) { return taskTime.after(workerServerStartDate); } return false; } ``` The failover code use the `start_time` of task intance, instead of `submit_time`. And `start_time` will not update every `NEED_FAULT_TOLERANCE`, the `submit_time` is the updated one. Which cause task submit again and again, because this mr task will not finsih within the failover check interval, which is 10min And here is the code in version 3.x ``` // The worker is active, may already send some new task to it if (taskInstance.getSubmitTime() != null && taskInstance.getSubmitTime() .after(needFailoverWorkerStartTime.get())) { LOGGER.info( "The taskInstance's submitTime: {} is after the need failover worker's start time: {}, the taskInstance is newly submit, it doesn't need to failover", taskInstance.getSubmitTime(), needFailoverWorkerStartTime.get()); return false; } ``` ### What you expected to happen Worker failover should use `submit_time` to judge And also, we wanna to know that, if there is a hive sql task or shell task, and when worker failover happen, will this sql task repeat run, because we can not found the code about kill the hive sql. If not, may it be a problem? ### How to reproduce 1. create process with a mr task, which run longer than the failover check interval 2. submit a process instance 3. restart worker 4. check how many mr task submit on yarn ### Anything else _No response_ ### Version 2.0.x ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
