github-actions[bot] commented on issue #17317: URL: https://github.com/apache/dolphinscheduler/issues/17317#issuecomment-3031289079
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened The latest version was deployed using dolphinscheduler user; Create a new Spark task type submitted by Yarn Cluster, select the op tenant to submit and run, and then Kill the task on the interface; It was found that the task process on the Worker machine was Killed, but the associated Yarn task was not Killed. ### What you expected to happen After Killing the workflow on the UI interface, if it is a Yarn type task, in addition to Killing the Worker local process, Kill Yarn Application should also be performed. 1. Code file location: dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/AbstractCommandExecutor.java The method of bugs is: ` public void cancelApplication() throws InterruptedException { if (process == null) { return; } // Try to kill process tree boolean killed = ProcessUtils.kill(taskRequest); if (killed) { log.info("Process tree for task: {} is killed or already finished, pid: {}", taskRequest.getTaskAppId(), taskRequest.getProcessId()); } else { log.error("Failed to kill process tree for task: {}, pid: {}", taskRequest.getTaskAppId(), taskRequest.getProcessId()); } } ` The cancelApplication method only triggers the logic of Kill, the local process of the Worker machine, missing the logic of Kill Yarn Application. 2. Code file location: dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/am/YarnApplicationManager.java The method of bugs is: ` @Override public boolean killApplication(YarnApplicationManagerContext yarnApplicationManagerContext) throws TaskException { String executePath = yarnApplicationManagerContext.getExecutePath(); String tenantCode = yarnApplicationManagerContext.getTenantCode(); List<String> appIds = yarnApplicationManagerContext.getAppIds(); try { String commandFile = String.format("%s/%s.kill", executePath, String.join(Constants.UNDERLINE, appIds)); String cmd = getKerberosInitCommand() + "yarn application -kill " + String.join(Constants.SPACE, appIds); execYarnKillCommand(tenantCode, commandFile, cmd); } catch (Exception e) { log.error("Kill yarn application [{}] failed", appIds, e); throw new TaskException(e.getMessage()); } return true; } ` The execYarnKillCommand method is normal. After the Kill Yarn App, the AbstractShell.ExitCodeException will still be thrown, which needs to be handled with special compatibility at the lowest cost. 3. Code file location: dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/OSUtils.java The method of bugs is: ` public static String getSudoCmd(String tenantCode, String command) { if (!isSudoEnable() || StringUtils.isEmpty(tenantCode)) { return command; } return String.format("sudo -u %s %s", tenantCode, command); } ` When assembling scripts, you need to add the -i option. For example, if you execute sudo -u dolphinscheduler yarn application -kill application_1749462877863_1818, an error will occur: yarn: command not found ### How to reproduce 1. The spark task script is as follows: ` ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=2G --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=8G /data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 10000000 ` 2. Kill workflow on the UI interface. After a while, the task instance and workflow instance are displayed as having been Killed, but the Yarn task is not triggered: ` 2025-07-02 11:52:17.996 INFO [exclusive-task-executor-container-worker-0] - Final Shell file is: *************************** Script Content ********************************************************* #!/bin/bash BASEDIR=$(cd `dirname $0`; pwd) cd $BASEDIR source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh kinit -kt /etc/security/keytabs/hdfs.keytab hdfs/[email protected] ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=2G --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=8G /data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 10000000 *************************** Script Content ********************************************************* Executing shell command : sudo -u op -i /data01/dolphinscheduler/exec/process/119/119.sh process start, process id is: 171245 ...... ...... Begin killing task instance, processId: 171245 prepare to parse pid, raw pid string: sudo(171245)---119.sh(171260)---java(171337)-+-{java}(171511) process has exited. execute path:/data01/dolphinscheduler/exec/process/119, processId:171245,exitStatusCode:143,processWaitForStatus:true,processExitValue:143 Start finding appId in /data01/dolphinscheduler/20250702/145403649079392/5/73/119.log, fetch way: log Find appId: application_1749462877863_1796 from /data01/dolphinscheduler/20250702/145403649079392/5/73/119.log ` 3, however, using yarn application -list to view on the Worker machine, I found that application_1749462877863_1796 still exists # yarn application -list ...... application_1749462877863_1796 org.apache.spark.examples.SparkPi SPARK hdfs default RUNNING UNDEFINED 10% http://nm-bigdata-168030014.ctc.local:27865 ` 4. After solving the above bug that did not trigger the Kill Yarn Application, the following error log appears after running again: kill cmd:sudo -u dolphinscheduler sh /data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill Kill yarn application [[application_1749462877863_1818]] failed org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: /data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill: line 10: yarn: command not found at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:205) at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86) at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:342) Use yarn application -list to view on the Worker machine and find that application_1749462877863_1818 still exists ### Anything else _No response_ ### Version 3.3.0-alpha ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
