chinashenkai opened a new issue #7304: URL: https://github.com/apache/dolphinscheduler/issues/7304
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened 我的Dolphinscheduler 在 A 集群上, 通过 SSH到B集群上执行Yarn任务时, dolphin 会监控application_id, 但是这个application_id 在 A集群上是找不到的(因为它在B集群Yarn上), 这导致了某些任务会出现显示任务错误, 实际上正常完成. 日志如下 ``` [INFO] 2021-12-10 11:12:18.415 - [taskAppId=TASK-118-6871-11452]:[138] - -> SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cslc/apache-hive-2.0.0-bin/lib/hive-jdbc-2.0.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cslc/apache-hive-2.0.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cslc/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] [INFO] 2021-12-10 11:12:25.417 - [taskAppId=TASK-118-6871-11452]:[138] - -> Logging initialized using configuration in file:/opt/cslc/apache-hive-2.0.0-bin/conf/hive-log4j2.properties [INFO] 2021-12-10 11:12:39.419 - [taskAppId=TASK-118-6871-11452]:[138] - -> OK Time taken: 3.464 seconds [INFO] 2021-12-10 11:12:41.420 - [taskAppId=TASK-118-6871-11452]:[138] - -> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = dip_20211210111235_08db789d-05e7-43ab-aba9-f7984774e4d4 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> [INFO] 2021-12-10 11:12:42.421 - [taskAppId=TASK-118-6871-11452]:[138] - -> Starting Job = job_1631871075019_193985, Tracking URL = http://pdip002:8188/proxy/application_1631871075019_193985/ Kill Command = /opt/cslc/hadoop-2.7.2/bin/hadoop job -kill job_1631871075019_193985 [INFO] 2021-12-10 11:12:52.423 - [taskAppId=TASK-118-6871-11452]:[138] - -> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2021-12-10 11:12:51,695 Stage-1 map = 0%, reduce = 0% [INFO] 2021-12-10 11:12:58.424 - [taskAppId=TASK-118-6871-11452]:[138] - -> 2021-12-10 11:12:58,056 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.57 sec [INFO] 2021-12-10 11:13:04.425 - [taskAppId=TASK-118-6871-11452]:[138] - -> 2021-12-10 11:13:04,406 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.01 sec [INFO] 2021-12-10 11:13:06.426 - [taskAppId=TASK-118-6871-11452]:[138] - -> MapReduce Total cumulative CPU time: 8 seconds 10 msec Ended Job = job_1631871075019_193985 [INFO] 2021-12-10 11:13:06.996 - [taskAppId=TASK-118-6871-11452]:[447] - find app id: application_1631871075019_193985 [INFO] 2021-12-10 11:13:06.996 - [taskAppId=TASK-118-6871-11452]:[404] - check yarn application status, appId:application_1631871075019_193985 [ERROR] 2021-12-10 11:13:07.014 - [taskAppId=TASK-118-6871-11452]:[420] - yarn applications: application_1631871075019_193985 , query status failed, exception:{} java.lang.NullPointerException: null at org.apache.dolphinscheduler.common.utils.HadoopUtils.getApplicationStatus(HadoopUtils.java:423) at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.isSuccessOfYarnState(AbstractCommandExecutor.java:406) at org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:230) at org.apache.dolphinscheduler.server.worker.task.shell.ShellTask.handle(ShellTask.java:101) at org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [INFO] 2021-12-10 11:13:07.014 - [taskAppId=TASK-118-6871-11452]:[238] - process has exited, execute path:/cslc/dip001/dolphinscheduler_exec/exec/process/2/118/6871/11452, processId:40838 ,exitStatusCode:-1 ,processWaitForStatus:true ,processExitValue:0 [INFO] 2021-12-10 11:13:07.427 - [taskAppId=TASK-118-6871-11452]:[138] - -> MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.01 sec HDFS Read: 97430 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 8 seconds 10 msec OK 507 Time taken: 27.786 seconds, Fetched: 1 row(s) ``` ### What you expected to happen 是否应该不检查 Yarn 任务 application_id 的状态? ### How to reproduce 将 Dolphinscheduler部署 在 A 集群上, 通过 SSH到B集群上执行Yarn任务(A B 集群部署了两个Yarn, 另外 ssh 命令例子: ssh user@host "command" ), dolphin 会监控application_id, 但是这个application_id 在 A集群上是找不到的(因为它在B集群Yarn上), 这导致了某些任务会出现显示任务错误, 实际上正常完成. ### Anything else _No response_ ### Version 1.3.9 ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
