njnu-seafish opened a new issue, #17317:
URL: https://github.com/apache/dolphinscheduler/issues/17317

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   使用dolphinscheduler用户部署了最新的版本;
   新建一个Yarn Cluster方式提交的Spark任务类型,选择op租户提交运行之后,然后在界面上Kill该任务;
   发现Worker机器上的任务进程被Kill掉了,但是关联的Yarn任务并没有被Kill。
   
   ### What you expected to happen
   
   UI界面上Kill掉工作流之后,如果是Yarn类型的任务,除了Kill掉Worker本地进程外,还应该执行Kill Yarn 
Application的操作。
   
   
   
1,代码文件位置:dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/AbstractCommandExecutor.java
   出现BUG的方法为:
   `
   public void cancelApplication() throws InterruptedException {
           if (process == null) {
               return;
           }
   
           // Try to kill process tree
           boolean killed = ProcessUtils.kill(taskRequest);
           if (killed) {
               log.info("Process tree for task: {} is killed or already 
finished, pid: {}",
                       taskRequest.getTaskAppId(), taskRequest.getProcessId());
           } else {
               log.error("Failed to kill process tree for task: {}, pid: {}",
                       taskRequest.getTaskAppId(), taskRequest.getProcessId());
           }
       }
   `
   cancelApplication方法只是触发了Worker机器本地进程Kill的逻辑,遗漏了Kill Yarn Application的逻辑。
   
   
   
2,代码文件位置:dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/am/YarnApplicationManager.java
   出现BUG的方法为:
   `
        @Override
       public boolean killApplication(YarnApplicationManagerContext 
yarnApplicationManagerContext) throws TaskException {
           String executePath = yarnApplicationManagerContext.getExecutePath();
           String tenantCode = yarnApplicationManagerContext.getTenantCode();
           List<String> appIds = yarnApplicationManagerContext.getAppIds();
   
           try {
               String commandFile = String.format("%s/%s.kill", executePath, 
String.join(Constants.UNDERLINE, appIds));
               String cmd = getKerberosInitCommand() + "yarn application -kill 
" + String.join(Constants.SPACE, appIds);
               execYarnKillCommand(tenantCode, commandFile, cmd);
           } catch (Exception e) {
               log.error("Kill yarn application [{}] failed", appIds, e);
               throw new TaskException(e.getMessage());
           }
   
           return true;
       }
   `
   execYarnKillCommand方法正常Kill Yarn 
App之后,依然会抛出AbstractShell.ExitCodeException异常,需要以最小的代价特殊兼容处理。
   
   
   
   
3,代码文件位置:dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/OSUtils.java
   出现BUG的方法为:
   `
        public static String getSudoCmd(String tenantCode, String command) {
           if (!isSudoEnable() || StringUtils.isEmpty(tenantCode)) {
               return command;
           }
           return String.format("sudo -u %s %s", tenantCode, command);
       }
   `
   getSudoCmd方法拼装脚本时需要添加-i选项,例如执行sudo -u dolphinscheduler yarn application 
-kill application_1749462877863_1818会出现错误:yarn: command not found
   
   ### How to reproduce
   
   1,spark任务脚本如下:
   `
   ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf 
spark.driver.memory=2G --conf spark.executor.instances=2 --conf 
spark.executor.cores=2 --conf spark.executor.memory=8G 
/data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 
10000000
   `
   
   
   2,UI界面上Kill工作流,过一会之后,任务实例和工作流实例都显示为已被Kill,但没有触发Yarn任务:
   `
   2025-07-02 11:52:17.996 INFO  [exclusive-task-executor-container-worker-0] - 
Final Shell file is: 
   ****************************** Script Content 
*****************************************************************
   #!/bin/bash
   BASEDIR=$(cd `dirname $0`; pwd)
   cd $BASEDIR
   source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh
   kinit -kt /etc/security/keytabs/hdfs.keytab 
hdfs/[email protected]
   ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf 
spark.driver.memory=2G --conf spark.executor.instances=2 --conf 
spark.executor.cores=2 --conf spark.executor.memory=8G 
/data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 
10000000
   ****************************** Script Content 
*****************************************************************
   
   Executing shell command : sudo -u op -i 
/data01/dolphinscheduler/exec/process/119/119.sh
   process start, process id is: 171245
   ......
   ......
   Begin killing task instance, processId: 171245
   prepare to parse pid, raw pid string: 
sudo(171245)---119.sh(171260)---java(171337)-+-{java}(171511)
   process has exited. execute path:/data01/dolphinscheduler/exec/process/119, 
processId:171245 ,exitStatusCode:143 ,processWaitForStatus:true 
,processExitValue:143
   Start finding appId in 
/data01/dolphinscheduler/20250702/145403649079392/5/73/119.log, fetch way: log 
   Find appId: application_1749462877863_1796 from 
/data01/dolphinscheduler/20250702/145403649079392/5/73/119.log
   `
   
   
   3,但是,在Worker机器上使用yarn application 
-list查看,发现application_1749462877863_1796依然存在
   # yarn application -list
   ......
   application_1749462877863_1796       org.apache.spark.examples.SparkPi       
               SPARK          hdfs         default                 RUNNING      
         UNDEFINED                  10% 
http://nm-bigdata-168030014.ctc.local:27865
   `
   
   
   4,解决完上述不触发Kill Yarn Application的BUG之后,再次运行又出现以下错误日志:
   kill cmd:sudo -u dolphinscheduler sh 
/data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill
   Kill yarn application [[application_1749462877863_1818]] failed
   org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: 
/data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill: 
line 10: yarn: command not found
        at 
org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:205)
        at 
org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118)
        at 
org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125)
        at 
org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103)
        at 
org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86)
        at 
org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:342)
   在Worker机器上使用yarn application -list查看,发现application_1749462877863_1818依然存在
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.3.0-alpha
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to