github-actions[bot] commented on issue #17317:
URL: 
https://github.com/apache/dolphinscheduler/issues/17317#issuecomment-3031289079

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   The latest version was deployed using dolphinscheduler user;
   Create a new Spark task type submitted by Yarn Cluster, select the op tenant 
to submit and run, and then Kill the task on the interface;
   It was found that the task process on the Worker machine was Killed, but the 
associated Yarn task was not Killed.
   
   ### What you expected to happen
   
   After Killing the workflow on the UI interface, if it is a Yarn type task, 
in addition to Killing the Worker local process, Kill Yarn Application should 
also be performed.
   
   
   1. Code file location: 
dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/AbstractCommandExecutor.java
   The method of bugs is:
   `
   public void cancelApplication() throws InterruptedException {
           if (process == null) {
               return;
           }
   
           // Try to kill process tree
           boolean killed = ProcessUtils.kill(taskRequest);
           if (killed) {
               log.info("Process tree for task: {} is killed or already 
finished, pid: {}",
                       taskRequest.getTaskAppId(), taskRequest.getProcessId());
           } else {
               log.error("Failed to kill process tree for task: {}, pid: {}",
                       taskRequest.getTaskAppId(), taskRequest.getProcessId());
           }
       }
   `
   The cancelApplication method only triggers the logic of Kill, the local 
process of the Worker machine, missing the logic of Kill Yarn Application.
   
   
   2. Code file location: 
dolphinscheduler-task-plugin/dolphinscheduler-task-api/src/main/java/org/apache/dolphinscheduler/plugin/task/api/am/YarnApplicationManager.java
   The method of bugs is:
   `
        @Override
       public boolean killApplication(YarnApplicationManagerContext 
yarnApplicationManagerContext) throws TaskException {
           String executePath = yarnApplicationManagerContext.getExecutePath();
           String tenantCode = yarnApplicationManagerContext.getTenantCode();
           List<String> appIds = yarnApplicationManagerContext.getAppIds();
   
           try {
               String commandFile = String.format("%s/%s.kill", executePath, 
String.join(Constants.UNDERLINE, appIds));
               String cmd = getKerberosInitCommand() + "yarn application -kill 
" + String.join(Constants.SPACE, appIds);
               execYarnKillCommand(tenantCode, commandFile, cmd);
           } catch (Exception e) {
               log.error("Kill yarn application [{}] failed", appIds, e);
               throw new TaskException(e.getMessage());
           }
   
           return true;
       }
   `
   The execYarnKillCommand method is normal. After the Kill Yarn App, the 
AbstractShell.ExitCodeException will still be thrown, which needs to be handled 
with special compatibility at the lowest cost.
   
   
   
   3. Code file location: 
dolphinscheduler-common/src/main/java/org/apache/dolphinscheduler/common/utils/OSUtils.java
   The method of bugs is:
   `
        public static String getSudoCmd(String tenantCode, String command) {
           if (!isSudoEnable() || StringUtils.isEmpty(tenantCode)) {
               return command;
           }
           return String.format("sudo -u %s %s", tenantCode, command);
       }
   `
   When assembling scripts, you need to add the -i option. For example, if you 
execute sudo -u dolphinscheduler yarn application -kill 
application_1749462877863_1818, an error will occur: yarn: command not found
   
   ### How to reproduce
   
   1. The spark task script is as follows:
   `
   ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf 
spark.driver.memory=2G --conf spark.executor.instances=2 --conf 
spark.executor.cores=2 --conf spark.executor.memory=8G 
/data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 
10000000
   `
   
   
   2. Kill workflow on the UI interface. After a while, the task instance and 
workflow instance are displayed as having been Killed, but the Yarn task is not 
triggered:
   `
   2025-07-02 11:52:17.996 INFO [exclusive-task-executor-container-worker-0] - 
Final Shell file is:
   *************************** Script Content 
*********************************************************
   #!/bin/bash
   BASEDIR=$(cd `dirname $0`; pwd)
   cd $BASEDIR
   source /usr/local/dolphinscheduler/bin/env/dolphinscheduler_env.sh
   kinit -kt /etc/security/keytabs/hdfs.keytab 
hdfs/[email protected]
   ${SPARK_HOME}/bin/spark-submit --master yarn --deploy-mode cluster --class 
org.apache.spark.examples.SparkPi --conf spark.driver.cores=1 --conf 
spark.driver.memory=2G --conf spark.executor.instances=2 --conf 
spark.executor.cores=2 --conf spark.executor.memory=8G 
/data01/dolphinscheduler/exec/process/119/suyc/spark-examples_2.12-3.2.2.jar 
10000000
   *************************** Script Content 
*********************************************************
   
   Executing shell command : sudo -u op -i 
/data01/dolphinscheduler/exec/process/119/119.sh
   process start, process id is: 171245
   ......
   ......
   Begin killing task instance, processId: 171245
   prepare to parse pid, raw pid string: 
sudo(171245)---119.sh(171260)---java(171337)-+-{java}(171511)
   process has exited. execute path:/data01/dolphinscheduler/exec/process/119, 
processId:171245,exitStatusCode:143,processWaitForStatus:true,processExitValue:143
   Start finding appId in 
/data01/dolphinscheduler/20250702/145403649079392/5/73/119.log, fetch way: log
   Find appId: application_1749462877863_1796 from 
/data01/dolphinscheduler/20250702/145403649079392/5/73/119.log
   `
   
   
   3, however, using yarn application -list to view on the Worker machine, I 
found that application_1749462877863_1796 still exists
   # yarn application -list
   ......
   application_1749462877863_1796 org.apache.spark.examples.SparkPi SPARK hdfs 
default RUNNING UNDEFINED 10% http://nm-bigdata-168030014.ctc.local:27865
   `
   
   
   4. After solving the above bug that did not trigger the Kill Yarn 
Application, the following error log appears after running again:
   kill cmd:sudo -u dolphinscheduler sh 
/data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill
   Kill yarn application [[application_1749462877863_1818]] failed
   org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: 
/data01/dolphinscheduler/exec/process/121/application_1749462877863_1818.kill: 
line 10: yarn: command not found
        at 
org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:205)
        at 
org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118)
        at 
org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125)
        at 
org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103)
        at 
org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86)
        at 
org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:342)
   Use yarn application -list to view on the Worker machine and find that 
application_1749462877863_1818 still exists
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.3.0-alpha
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to