[GitHub] [dolphinscheduler] chinashenkai opened a new issue #7304: [Bug] [server] SSH到其他集群执行Yarn任务

GitBox Thu, 09 Dec 2021 19:23:09 -0800


chinashenkai opened a new issue #7304:
URL: https://github.com/apache/dolphinscheduler/issues/7304



   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   我的Dolphinscheduler 在 A 集群上, 通过 SSH到B集群上执行Yarn任务时, dolphin 会监控application_id, 
但是这个application_id 在 A集群上是找不到的(因为它在B集群Yarn上), 这导致了某些任务会出现显示任务错误, 实际上正常完成.
   
   日志如下
   ```
   [INFO] 2021-12-10 11:12:18.415  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> SLF4J: Class path contains multiple SLF4J bindings.
        SLF4J: Found binding in 
[jar:file:/opt/cslc/apache-hive-2.0.0-bin/lib/hive-jdbc-2.0.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: Found binding in 
[jar:file:/opt/cslc/apache-hive-2.0.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: Found binding in 
[jar:file:/opt/cslc/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
        SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
        SLF4J: Actual binding is of type 
[org.apache.logging.slf4j.Log4jLoggerFactory]
   [INFO] 2021-12-10 11:12:25.417  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> 
        Logging initialized using configuration in 
file:/opt/cslc/apache-hive-2.0.0-bin/conf/hive-log4j2.properties
   [INFO] 2021-12-10 11:12:39.419  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> OK
        Time taken: 3.464 seconds
   [INFO] 2021-12-10 11:12:41.420  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (i.e. spark, tez) 
or using Hive 1.X releases.
        Query ID = dip_20211210111235_08db789d-05e7-43ab-aba9-f7984774e4d4
        Total jobs = 1
        Launching Job 1 out of 1
        Number of reduce tasks determined at compile time: 1
        In order to change the average load for a reducer (in bytes):
          set hive.exec.reducers.bytes.per.reducer=<number>
        In order to limit the maximum number of reducers:
          set hive.exec.reducers.max=<number>
        In order to set a constant number of reducers:
          set mapreduce.job.reduces=<number>
   [INFO] 2021-12-10 11:12:42.421  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> Starting Job = job_1631871075019_193985, Tracking URL = 
http://pdip002:8188/proxy/application_1631871075019_193985/
        Kill Command = /opt/cslc/hadoop-2.7.2/bin/hadoop job  -kill 
job_1631871075019_193985
   [INFO] 2021-12-10 11:12:52.423  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 1
        2021-12-10 11:12:51,695 Stage-1 map = 0%,  reduce = 0%
   [INFO] 2021-12-10 11:12:58.424  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> 2021-12-10 11:12:58,056 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
4.57 sec
   [INFO] 2021-12-10 11:13:04.425  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> 2021-12-10 11:13:04,406 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
8.01 sec
   [INFO] 2021-12-10 11:13:06.426  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> MapReduce Total cumulative CPU time: 8 seconds 10 msec
        Ended Job = job_1631871075019_193985
   [INFO] 2021-12-10 11:13:06.996  - [taskAppId=TASK-118-6871-11452]:[447] - 
find app id: application_1631871075019_193985
   [INFO] 2021-12-10 11:13:06.996  - [taskAppId=TASK-118-6871-11452]:[404] - 
check yarn application status, appId:application_1631871075019_193985
   [ERROR] 2021-12-10 11:13:07.014  - [taskAppId=TASK-118-6871-11452]:[420] - 
yarn applications: application_1631871075019_193985 , query status failed, 
exception:{}
   java.lang.NullPointerException: null
        at 
org.apache.dolphinscheduler.common.utils.HadoopUtils.getApplicationStatus(HadoopUtils.java:423)
        at 
org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.isSuccessOfYarnState(AbstractCommandExecutor.java:406)
        at 
org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:230)
        at 
org.apache.dolphinscheduler.server.worker.task.shell.ShellTask.handle(ShellTask.java:101)
        at 
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:139)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
   [INFO] 2021-12-10 11:13:07.014  - [taskAppId=TASK-118-6871-11452]:[238] - 
process has exited, execute 
path:/cslc/dip001/dolphinscheduler_exec/exec/process/2/118/6871/11452, 
processId:40838 ,exitStatusCode:-1 ,processWaitForStatus:true 
,processExitValue:0
   [INFO] 2021-12-10 11:13:07.427  - [taskAppId=TASK-118-6871-11452]:[138] -  
-> MapReduce Jobs Launched: 
        Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.01 sec   HDFS 
Read: 97430 HDFS Write: 4 SUCCESS
        Total MapReduce CPU Time Spent: 8 seconds 10 msec
        OK
        507
        Time taken: 27.786 seconds, Fetched: 1 row(s)
   ```
   
   ### What you expected to happen
   
   是否应该不检查 Yarn 任务 application_id 的状态?
   
   ### How to reproduce
   
   将 Dolphinscheduler部署 在 A 集群上, 通过 SSH到B集群上执行Yarn任务(A B 集群部署了两个Yarn, 另外 ssh 
命令例子: ssh user@host "command" ), dolphin 会监控application_id, 但是这个application_id 
在 A集群上是找不到的(因为它在B集群Yarn上), 这导致了某些任务会出现显示任务错误, 实际上正常完成.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   1.3.9
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] chinashenkai opened a new issue #7304: [Bug] [server] SSH到其他集群执行Yarn任务

Reply via email to