njnu-seafish opened a new issue, #17316: URL: https://github.com/apache/dolphinscheduler/issues/17316
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened 使用dolphinscheduler用户部署了最新的版本; 新建一个shell的任务类型,选择默认的defalut租户提交运行之后,然后在界面上Kill该任务; 发现Worker机器上的任务进程被没有被真正Kill掉,但是任务实例和工作流实例却显示为正常Kill掉了。 ### What you expected to happen UI界面上Kill掉工作流之后,如果任务实例显示已被正常Kill,那么应该将Worker机器上相关的任务进程正常地Kill掉。 代码文件位置:dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/utils/ProcessUtils.java 出现BUG的方法为: ` public static boolean kill(@NonNull TaskExecutionContext request) { try { log.info("Begin killing task instance, processId: {}", request.getProcessId()); int processId = request.getProcessId(); if (processId == 0) { log.info("Task instance has already finished, no need to kill"); return true; } // Get all child processes String pids = getPidsStr(processId); String[] pidArray = pids.split("\\s+"); if (pidArray.length == 0) { log.warn("No valid PIDs found for process: {}", processId); return true; } // 1. Try to terminate gracefully (SIGINT) boolean gracefulKillSuccess = sendKillSignal("SIGINT", pids, request.getTenantCode()); if (gracefulKillSuccess) { log.info("Successfully killed process tree using SIGINT, processId: {}", processId); return true; } // 2. Try to terminate forcefully (SIGTERM) boolean termKillSuccess = sendKillSignal("SIGTERM", pids, request.getTenantCode()); if (termKillSuccess) { log.info("Successfully killed process tree using SIGTERM, processId: {}", processId); return true; } // 3. As a last resort, use `kill -9` log.warn("SIGINT & SIGTERM failed, using SIGKILL as a last resort for processId: {}", processId); boolean forceKillSuccess = sendKillSignal("SIGKILL", pids, request.getTenantCode()); if (forceKillSuccess) { log.info("Successfully sent SIGKILL signal to process tree, processId: {}", processId); } else { log.error("Error sending SIGKILL signal to process tree, processId: {}", processId); } return forceKillSuccess; } catch (Exception e) { log.error("Kill task instance error, processId: {}", request.getProcessId(), e); return false; } } ` sendKillSignal方法只是发出了Kill进程的信号,并不能保证操作系统底层真正Kill掉了进程,需要在发出Kill信号之后添加检查进程是否存活的逻辑。 ### How to reproduce 1,shell任务脚本如下: ` echo ${JAVE_HOME}; sleep 10m ` 2,UI界面上Kill工作流,过一会之后,任务实例和工作流实例都显示为已被Kill,任务实例的日志如下: ` Executing shell command : sudo -u dolphinscheduler -i /data01/dolphinscheduler/exec/process/87/87.sh process start, process id is: 3502853 ...... Begin killing task instance, processId: 3502853 prepare to parse pid, raw pid string: sudo(3502853)---87.sh(3502868)---sleep(3502945) Sending SIGINT to process group: 3502853 3502868 3502945, command: sudo -u dolphinscheduler kill -s SIGINT 3502853 3502868 3502945 Successfully killed process tree using SIGINT, processId: 3502853 Process tree for task: 87 is killed or already finished, pid: 3502853 ` 3,但是,在Worker机器上使用pstree命令查看该任务进程依然存在,并没有被Kill掉 ` $ pstree -p 3502853 sudo(3502853)───87.sh(3502868)───sleep(3502945) ### Anything else _No response_ ### Version 3.3.0-alpha ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
