njnu-seafish opened a new issue, #17317: URL: https://github.com/apache/dolphinscheduler/issues/17317
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened 使用dolphinscheduler用户部署了最新的版本; 新建一个Yarn Cluster方式提交的Spark任务类型,选择op租户提交运行之后,然后在界面上Kill该任务; 发现Worker机器上的任务进程被Kill掉了,但是关联的Yarn任务并没有被Kill。 ### What you expected to happen UI界面上Kill掉工作流之后,如果是Yarn类型的任务,除了Kill掉Worker本地进程外,还应该执行Kill Yarn Application的操作。 1,代码文件位置:dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/AbstractCommandExecutor.java 出现BUG的方法为: ` public void cancelApplication() throws InterruptedException { if (process == null) { return; } // Try to kill process tree boolean killed = ProcessUtils.kill(taskRequest); if (killed) { log.info("Process tree for task: {} is killed or already finished, pid: {}", taskRequest.getTaskAppId(), taskRequest.getProcessId()); } else { log.error("Failed to kill process tree for task: {}, pid: {}", taskRequest.getTaskAppId(), taskRequest.getProcessId()); } } ` cancelApplication方法只是触发了Worker机器本地进程Kill的逻辑,遗漏了Kill Yarn Application的逻辑。 2,代码文件位置:dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/am/YarnApplicationManager.java 出现BUG的方法为: ` @Override public boolean killApplication(YarnApplicationManagerContext yarnApplicationManagerContext) throws TaskException { String executePath = yarnApplicationManagerContext.getExecutePath(); String tenantCode = yarnApplicationManagerContext.getTenantCode(); List<String> appIds = yarnApplicationManagerContext.getAppIds(); try { String commandFile = String.format("%s/%s.kill", executePath, String.join(Constants.UNDERLINE, appIds)); String cmd = getKerberosInitCommand() + "yarn application -kill " + String.join(Constants.SPACE, appIds); execYarnKillCommand(tenantCode, commandFile, cmd); } catch (Exception e) { log.error("Kill yarn application [{}] failed", appIds, e); throw new TaskException(e.getMessage()); } return true; } ` execYarnKillCommand方法正常Kill Yarn App之后,依然会抛出AbstractShell.ExitCodeException异常,需要以最小的代价特殊兼容处理。 3,代码文件位置:dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/OSUtils.java 出现BUG的方法为: ` public static String getSudoCmd(String tenantCode, String command) { if (!isSudoEnable() || StringUtils.isEmpty(tenantCode)) { return command; } return String.format("sudo -u %s %s", tenantCode, command); } ` getSudoCmd方法拼装脚本时需要添加-i选项,例如执行sudo -u dolphinscheduler yarn application -kill application_1749462877863_1818会出现错误:yarn: command not found ### How to reproduce 1,spark任务脚本如下: ` ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=2G --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=8G /data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 10000000 ` 2,UI界面上Kill工作流,过一会之后,任务实例和工作流实例都显示为已被Kill,但没有触发Yarn任务: ` 2025-07-02 11:52:17.996 INFO [exclusive-task-executor-container-worker-0] - Final Shell file is: ****************************** Script Content ***************************************************************** #!/bin/bash BASEDIR=$(cd `dirname $0`; pwd) cd $BASEDIR source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh kinit -kt /etc/security/keytabs/hdfs.keytab hdfs/[email protected] ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=2G --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=8G /data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 10000000 ****************************** Script Content ***************************************************************** Executing shell command : sudo -u op -i /data01/dolphinscheduler/exec/process/119/119.sh process start, process id is: 171245 ...... ...... Begin killing task instance, processId: 171245 prepare to parse pid, raw pid string: sudo(171245)---119.sh(171260)---java(171337)-+-{java}(171511) process has exited. execute path:/data01/dolphinscheduler/exec/process/119, processId:171245 ,exitStatusCode:143 ,processWaitForStatus:true ,processExitValue:143 Start finding appId in /data01/dolphinscheduler/20250702/145403649079392/5/73/119.log, fetch way: log Find appId: application_1749462877863_1796 from /data01/dolphinscheduler/20250702/145403649079392/5/73/119.log ` 3,但是,在Worker机器上使用yarn application -list查看,发现application_1749462877863_1796依然存在 # yarn application -list ...... application_1749462877863_1796 org.apache.spark.examples.SparkPi SPARK hdfs default RUNNING UNDEFINED 10% http://nm-bigdata-168030014.ctc.local:27865 ` 4,解决完上述不触发Kill Yarn Application的BUG之后,再次运行又出现以下错误日志: kill cmd:sudo -u dolphinscheduler sh /data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill Kill yarn application [[application_1749462877863_1818]] failed org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: /data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill: line 10: yarn: command not found at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:205) at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86) at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:342) 在Worker机器上使用yarn application -list查看,发现application_1749462877863_1818依然存在 ### Anything else _No response_ ### Version 3.3.0-alpha ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
