[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #7304: [Bug] [server] SSH to other clusters to execute Yarn tasks

GitBox Thu, 09 Dec 2021 19:23:31 -0800


github-actions[bot] commented on issue #7304:
URL: 
https://github.com/apache/dolphinscheduler/issues/7304#issuecomment-990576082



   ### Search before asking
   
   -[X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   When my Dolphinscheduler executes Yarn tasks on cluster A via SSH to cluster 
B, dolphin will monitor the application_id, but this application_id cannot be 
found on cluster A (because it is on cluster B Yarn), which leads to some Some 
tasks will appear to show task errors, but actually complete normally.
   
   The log is as follows
   ```
   [INFO] 2021-12-10 11:12:18.415-[taskAppId=TASK-118-6871-11452]:[138]--> 
SLF4J: Class path contains multiple SLF4J bindings.
   SLF4J: Found binding in 
[jar:file:/opt/cslc/apache-hive-2.0.0-bin/lib/hive-jdbc-2.0.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: Found binding in 
[jar:file:/opt/cslc/apache-hive-2.0.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: Found binding in 
[jar:file:/opt/cslc/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
   SLF4J: Actual binding is of type 
[org.apache.logging.slf4j.Log4jLoggerFactory]
   [INFO] 2021-12-10 11:12:25.417-[taskAppId=TASK-118-6871-11452]:[138]-->
   Logging initialized using configuration in 
file:/opt/cslc/apache-hive-2.0.0-bin/conf/hive-log4j2.properties
   [INFO] 2021-12-10 11:12:39.419-[taskAppId=TASK-118-6871-11452]:[138]--> OK
   Time taken: 3.464 seconds
   [INFO] 2021-12-10 11:12:41.420-[taskAppId=TASK-118-6871-11452]:[138]--> 
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
future versions. Consider using a different execution engine (ie spark, tez) or 
using Hive 1.X releases.
   Query ID = dip_20211210111235_08db789d-05e7-43ab-aba9-f7984774e4d4
   Total jobs = 1
   Launching Job 1 out of 1
   Number of reduce tasks determined at compile time: 1
   In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=<number>
   In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=<number>
   In order to set a constant number of reducers:
   set mapreduce.job.reduces=<number>
   [INFO] 2021-12-10 11:12:42.421-[taskAppId=TASK-118-6871-11452]:[138]--> 
Starting Job = job_1631871075019_193985, Tracking URL = 
http://pdip002:8188/proxy/ application_1631871075019_193985/
   Kill Command = /opt/cslc/hadoop-2.7.2/bin/hadoop job -kill 
job_1631871075019_193985
   [INFO] 2021-12-10 11:12:52.423-[taskAppId=TASK-118-6871-11452]:[138]--> 
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
   2021-12-10 11:12:51,695 Stage-1 map = 0%, reduce = 0%
   [INFO] 2021-12-10 11:12:58.424-[taskAppId=TASK-118-6871-11452]:[138]--> 
2021-12-10 11:12:58,056 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.57 sec
   [INFO] 2021-12-10 11:13:04.425-[taskAppId=TASK-118-6871-11452]:[138]--> 
2021-12-10 11:13:04,406 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.01 
sec
   [INFO] 2021-12-10 11:13:06.426-[taskAppId=TASK-118-6871-11452]:[138]--> 
MapReduce Total cumulative CPU time: 8 seconds 10 msec
   Ended Job = job_1631871075019_193985
   [INFO] 2021-12-10 11:13:06.996-[taskAppId=TASK-118-6871-11452]:[447]-find 
app id: application_1631871075019_193985
   [INFO] 2021-12-10 11:13:06.996-[taskAppId=TASK-118-6871-11452]:[404]-check 
yarn application status, appId:application_1631871075019_193985
   [ERROR] 2021-12-10 11:13:07.014-[taskAppId=TASK-118-6871-11452]:[420]-yarn 
applications: application_1631871075019_193985, query status failed, 
exception:{}
   java.lang.NullPointerException: null
   at 
org.apache.dolphinscheduler.common.utils.HadoopUtils.getApplicationStatus(HadoopUtils.java:423)
   at 
org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.isSuccessOfYarnState(AbstractCommandExecutor.java:406)
   at 
org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:230)
   at 
org.apache.dolphinscheduler.server.worker.task.shell.ShellTask.handle(ShellTask.java:101)
   at 
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:139)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
   [INFO] 2021-12-10 11:13:07.014-[taskAppId=TASK-118-6871-11452]:[238]-process 
has exited, execute path:/cslc/dip001/dolphinscheduler_exec/exec/process/2/ 
118/6871/11452, processId:40838 ,exitStatusCode:-1 ,processWaitForStatus:true 
,processExitValue:0
   [INFO] 2021-12-10 11:13:07.427-[taskAppId=TASK-118-6871-11452]:[138]--> 
MapReduce Jobs Launched:
   Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.01 sec HDFS Read: 97430 
HDFS Write: 4 SUCCESS
   Total MapReduce CPU Time Spent: 8 seconds 10 msec
   OK
   507
   Time taken: 27.786 seconds, Fetched: 1 row(s)
   ```
   
   ### What you expected to happen
   
   Shouldn't the status of Yarn task application_id be checked?
   
   ### How to reproduce
   
   Deploy Dolphinscheduler on cluster A and execute Yarn tasks on cluster B via 
SSH (two Yarns are deployed in cluster AB, and an example of ssh command: ssh 
user@host "command" ), dolphin will monitor application_id, but this 
application_id is in A It is not found on the cluster (because it is on the B 
cluster Yarn), which causes some tasks to display task errors, which actually 
complete normally.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   1.3.9
   
   ### Are you willing to submit PR?
   
   -[X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   -[X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #7304: [Bug] [server] SSH to other clusters to execute Yarn tasks

Reply via email to