njnu-seafish opened a new issue, #17359: URL: https://github.com/apache/dolphinscheduler/issues/17359
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened I try to kill spark job on yarn but failed. Logs show that "yarn: command not found" After fixing this, Logs show that kill yarn application failed with ExitCodeException. The exit code is 0, but errMsg is not null --------------------------------------------------------------------------------------------- there's the first logs: --------------------------------------------------------------------------------------------- 2025-07-22 10:48:27.128 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-36] - kill cmd:sudo -u dolphinscheduler sh /data01/dolphinscheduler/exec/process/147/application_1749462877863_5866.kill 2025-07-22 10:48:27.151 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-36] - Kill yarn application [[application_1749462877863_5866]] failed org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: /data01/dolphinscheduler/exec/process/147/application_1749462877863_5866.kill: line 10: yarn: command not found at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:205) at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86) at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:342) at org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:331) at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.execYarnKillCommand(YarnApplicationManager.java:91) at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:51) at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:38) at org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils.cancelApplication(ProcessUtils.java:345) at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.cancelApplication(AbstractCommandExecutor.java:226) at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.cancelApplication(AbstractYarnTask.java:91) at org.apache.dolphinscheduler.plugin.task.api.AbstractRemoteTask.cancel(AbstractRemoteTask.java:39) at org.apache.dolphinscheduler.server.worker.executor.PhysicalTaskExecutor.kill(PhysicalTaskExecutor.java:102) at org.apache.dolphinscheduler.task.executor.listener.TaskExecutorLifecycleEventListener.onTaskExecutorKillLifecycleEvent(TaskExecutorLifecycleEventListener.java:88) at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.doFireTaskExecutorEventBus(TaskExecutorEventBusCoordinator.java:166) at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.lambda$fireTaskExecutorEventBus$1(TaskExecutorEventBusCoordinator.java:123) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1646) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2025-07-22 10:48:27.151 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-36] - Cancel application failed: /data01/dolphinscheduler/exec/process/147/application_1749462877863_5866.kill: line 10: yarn: command not found --------------------------------------------------------------------------------------------- After fixing this, The second logs: --------------------------------------------------------------------------------------------- 2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Successfully killed process tree using SIGTERM, processId: 1219746 2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Process tree for task: 150 is killed or already finished, pid: 1219746 2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Get appIds from worker xxxxx:1234, taskLogPath: /data01/dolphinscheduler/20250722/145403649079392/5/103/150.log 2025-07-22 14:45:15.928 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Start finding appId in /data01/dolphinscheduler/20250722/145403649079392/5/103/150.log, fetch way: log 2025-07-22 14:45:15.929 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Find appId: application_1749462877863_5903 from /data01/dolphinscheduler/20250722/145403649079392/5/103/150.log 2025-07-22 14:45:15.930 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - get kerberos init command 2025-07-22 14:45:15.930 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - kerberos init command: export KRB5_CONFIG=/etc/krb5.conf kinit -k -t /etc/security/keytabs/hdfs.keytab hdfs/xxxxx || true 2025-07-22 14:45:15.930 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - kill cmd:sudo -u dolphinscheduler -i sh /data01/dolphinscheduler/exec/process/150/application_1749462877863_5903.kill 2025-07-22 14:45:17.398 INFO [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - exitCode: 0 2025-07-22 14:45:17.399 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Kill yarn application [[application_1749462877863_5903]] failed org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException: 2025-07-22 14:45:17,383 | INFO | impl.YarnClientImpl | Killed application application_1749462877863_5903 at org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:206) at org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:118) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:125) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:103) at org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:86) at org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:343) at org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:332) at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.execYarnKillCommand(YarnApplicationManager.java:91) at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:51) at org.apache.dolphinscheduler.plugin.task.api.am.YarnApplicationManager.killApplication(YarnApplicationManager.java:38) at org.apache.dolphinscheduler.plugin.task.api.utils.ProcessUtils.cancelApplication(ProcessUtils.java:345) at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.cancelApplication(AbstractCommandExecutor.java:226) at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.cancelApplication(AbstractYarnTask.java:91) at org.apache.dolphinscheduler.plugin.task.api.AbstractRemoteTask.cancel(AbstractRemoteTask.java:39) at org.apache.dolphinscheduler.server.worker.executor.PhysicalTaskExecutor.kill(PhysicalTaskExecutor.java:102) at org.apache.dolphinscheduler.task.executor.listener.TaskExecutorLifecycleEventListener.onTaskExecutorKillLifecycleEvent(TaskExecutorLifecycleEventListener.java:88) at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.doFireTaskExecutorEventBus(TaskExecutorEventBusCoordinator.java:166) at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorEventBusCoordinator.lambda$fireTaskExecutorEventBus$1(TaskExecutorEventBusCoordinator.java:123) at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1646) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) 2025-07-22 14:45:17.399 ERROR [PhysicalTaskExecutorEventBusCoordinator-eventbus-coordinator-worker-97] - Cancel application failed: 2025-07-22 14:45:17,383 | INFO | impl.YarnClientImpl | Killed application application_1749462877863_5903 ### What you expected to happen dolphinscheduler terminate yarn application successfully. ### How to reproduce I don't know if it's because my environment is special, but I've been failing consistently on my end. Has anyone else encountered a similar problem? ### Anything else --------------------------------------------------------------------------------------------- For the first question, my test is as follows. --------------------------------------------------------------------------------------------- [root@xxxxx][~] # sudo -u dolphinscheduler yarn version sudo: yarn: command not found [root@xxxxx][~] # sudo -u dolphinscheduler -i yarn version Hadoop 3.3.3 Source code repository Unknown -r Unknown Compiled by root on 2023-07-31T01:58Z Compiled with protoc 3.7.1 From source with checksum 9437955990f3957351278654266784fc This command was run using /usr/local/hadoop-3.3.3_ccdp3.3.3_1.0.2/share/hadoop/common/hadoop-common-3.3.3.jar [root@xxxxx][~] # su - dolphinscheduler [dolphinscheduler@xxxxx][~] $ yarn version Hadoop 3.3.3 Source code repository Unknown -r Unknown Compiled by root on 2023-07-31T01:58Z Compiled with protoc 3.7.1 From source with checksum 9437955990f3957351278654266784fc This command was run using /usr/local/hadoop-3.3.3_ccdp3.3.3_1.0.2/share/hadoop/common/hadoop-common-3.3.3.jar --------------------------------------------------------------------------------------------- For the second question, my test is as follows. --------------------------------------------------------------------------------------------- [root@xxxxx][/usr/local/dolphinscheduler] # yarn application -kill application_1749462877863_5866 Killing application application_1749462877863_5866 2025-07-22 14:03:59,361 | INFO | impl.YarnClientImpl | Killed application application_1749462877863_5866 [root@xxxxx][/usr/local/dolphinscheduler] # echo $? ### Version 3.3.0-alpha ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
