njnu-seafish opened a new issue, #17316:
URL: https://github.com/apache/dolphinscheduler/issues/17316

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   使用dolphinscheduler用户部署了最新的版本;
   新建一个shell的任务类型,选择默认的defalut租户提交运行之后,然后在界面上Kill该任务;
   发现Worker机器上的任务进程被没有被真正Kill掉,但是任务实例和工作流实例却显示为正常Kill掉了。
   
   ### What you expected to happen
   
   UI界面上Kill掉工作流之后,如果任务实例显示已被正常Kill,那么应该将Worker机器上相关的任务进程正常地Kill掉。
   
   
代码文件位置:dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java
   出现BUG的方法为:
   `
   public static boolean kill(@NonNull TaskExecutionContext request) {
           try {
               log.info("Begin killing task instance, processId: {}", 
request.getProcessId());
               int processId = request.getProcessId();
               if (processId == 0) {
                   log.info("Task instance has already finished, no need to 
kill");
                   return true;
               }
   
               // Get all child processes
               String pids = getPidsStr(processId);
               String[] pidArray = pids.split("\\s+");
               if (pidArray.length == 0) {
                   log.warn("No valid PIDs found for process: {}", processId);
                   return true;
               }
   
               // 1. Try to terminate gracefully (SIGINT)
               boolean gracefulKillSuccess = sendKillSignal("SIGINT", pids, 
request.getTenantCode());
               if (gracefulKillSuccess) {
                   log.info("Successfully killed process tree using SIGINT, 
processId: {}", processId);
                   return true;
               }
   
               // 2. Try to terminate forcefully (SIGTERM)
               boolean termKillSuccess = sendKillSignal("SIGTERM", pids, 
request.getTenantCode());
               if (termKillSuccess) {
                   log.info("Successfully killed process tree using SIGTERM, 
processId: {}", processId);
                   return true;
               }
   
               // 3. As a last resort, use `kill -9`
               log.warn("SIGINT & SIGTERM failed, using SIGKILL as a last 
resort for processId: {}", processId);
               boolean forceKillSuccess = sendKillSignal("SIGKILL", pids, 
request.getTenantCode());
               if (forceKillSuccess) {
                   log.info("Successfully sent SIGKILL signal to process tree, 
processId: {}", processId);
               } else {
                   log.error("Error sending SIGKILL signal to process tree, 
processId: {}", processId);
               }
               return forceKillSuccess;
   
           } catch (Exception e) {
               log.error("Kill task instance error, processId: {}", 
request.getProcessId(), e);
               return false;
           }
       }
   `
   
sendKillSignal方法只是发出了Kill进程的信号,并不能保证操作系统底层真正Kill掉了进程,需要在发出Kill信号之后添加检查进程是否存活的逻辑。
   
   ### How to reproduce
   
   1,shell任务脚本如下:
   `
   echo ${JAVE_HOME};
   sleep 10m
   `
   
   2,UI界面上Kill工作流,过一会之后,任务实例和工作流实例都显示为已被Kill,任务实例的日志如下:
   `
   Executing shell command : sudo -u dolphinscheduler -i 
/data01/dolphinscheduler/exec/process/87/87.sh
   process start, process id is: 3502853
   ......
   Begin killing task instance, processId: 3502853
   prepare to parse pid, raw pid string: 
sudo(3502853)---87.sh(3502868)---sleep(3502945)
   Sending SIGINT to process group: 3502853 3502868 3502945, command: sudo -u 
dolphinscheduler kill -s SIGINT 3502853 3502868 3502945
   Successfully killed process tree using SIGINT, processId: 3502853
   Process tree for task: 87 is killed or already finished, pid: 3502853
   `
   
   3,但是,在Worker机器上使用pstree命令查看该任务进程依然存在,并没有被Kill掉
   `
   $ pstree -p 3502853
   sudo(3502853)───87.sh(3502868)───sleep(3502945)
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.3.0-alpha
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to