KEN-LJQ opened a new issue, #13283:
URL: https://github.com/apache/dolphinscheduler/issues/13283

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   We have a long run mapreduce task, which will duration a few hours, like 
12hours. After we doing worker and master rolling restart, this mr task keep 
submit to yarn every 10 mins, and the old one failed to kill as we found the 
error log.
   Finally, this task repeat submit many mr instance cause out yarn cluster run 
out of resources.
   
   Our version is 2.0.7
   After, we diging into the code, we found something about the worker failover 
within `MasterRegistryClient`
   ```
   private boolean checkTaskAfterWorkerStart(List<Server> workerServers, 
TaskInstance taskInstance) {
           if (StringUtils.isEmpty(taskInstance.getHost())) {
               return false;
           }
   
           Date taskTime = taskInstance.getStartTime() == null ? 
taskInstance.getSubmitTime() : taskInstance.getStartTime();
   
           Date workerServerStartDate = getServerStartupTime(workerServers, 
taskInstance.getHost());
           if (workerServerStartDate != null) {
               return taskTime.after(workerServerStartDate);
           }
           return false;
       }
   ```
   The failover code use the `start_time`  of task intance,  instead of 
`submit_time`. And `start_time` will not update every `NEED_FAULT_TOLERANCE`, 
the `submit_time` is the updated one. Which cause task submit again and again, 
because this mr task will not finsih within the failover check interval, which 
is 10min
   
   And here is the code in version 3.x
   ```
   // The worker is active, may already send some new task to it
           if (taskInstance.getSubmitTime() != null && 
taskInstance.getSubmitTime()
               .after(needFailoverWorkerStartTime.get())) {
               LOGGER.info(
                   "The taskInstance's submitTime: {} is after the need 
failover worker's start time: {}, the taskInstance is newly submit, it doesn't 
need to failover",
                   taskInstance.getSubmitTime(),
                   needFailoverWorkerStartTime.get());
               return false;
           }
   ```
   
   ### What you expected to happen
   
   Worker failover should use `submit_time` to judge
   
   And also, we wanna to know that, if there is a hive sql task or shell task, 
and when worker failover happen, will this sql task repeat run, because we can 
not found the code about kill the hive sql. If not, may it be a problem?
   
   ### How to reproduce
   
   1. create process with a mr task, which run longer than the failover check 
interval
   2. submit a process instance
   3. restart worker
   4. check how many mr task submit on yarn
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   2.0.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to