[I] [Bug]使用helm部署dolphinscheduler3.1.2-workerPod内配置了spark环境并与外部的hadoop集群进行关联 yarn任务kill不掉 [dolphinscheduler]

via GitHub Fri, 03 Nov 2023 02:57:42 -0700


TheWindIsRising opened a new issue, #15120:
URL: https://github.com/apache/dolphinscheduler/issues/15120


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   dolphinscheduler版本：3.1.2
   
      value.xml已配置持久化，worker-Pod内配置了spark-2.4.7-bin-hadoop2.7与hadoop-2.7.0.tar 
环境 并拷贝了外部的hadoop中的hadoop、yarn、hdfs等site.xml文件
   
   使用helm部署的dolphinscheduler3.1.2，  提交任务到外部的hadoop集群上进行调度，在工作流实例页面点停止任务后， 
外部的yarn上任务还在继续执行，workerPod报错：
   ```
   Caused by: java.io.IOException: error=2, No such file or directory
           at java.lang.UNIXProcess.forkAndExec(Native Method)
           at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
           at java.lang.ProcessImpl.start(ProcessImpl.java:134)
           at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
           ... 17 common frames omitted
   [INFO] 2023-11-03 15:52:14.679 +0800 
org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor:[238] - 
[WorkflowInstance-0][TaskInstance-111] - Get appIds from worker 
dolphinscheduler-worker-0.dolphinscheduler-worker-headless:1234 taskLogPath: 
/opt/dolphinscheduler/logs/20231103/10676729589025_23-107-111.log
   [INFO] 2023-11-03 15:52:14.679 +0800 
org.apache.dolphinscheduler.service.log.LogClient:[208] - 
[WorkflowInstance-0][TaskInstance-111] - Begin to get appIds from worker: 
dolphinscheduler-worker-0.dolphinscheduler-worker-headless:1234 taskLogPath: 
/opt/dolphinscheduler/logs/20231103/10676729589025_23-107-111.log
   [INFO] 2023-11-03 15:52:14.680 +0800 
org.apache.dolphinscheduler.plugin.task.api.utils.LogUtils:[66] - 
[WorkflowInstance-0][TaskInstance-111] - Find appId: 
application_1693365157704_0040 from 
/opt/dolphinscheduler/logs/20231103/10676729589025_23-107-111.log
   [INFO] 2023-11-03 15:52:14.680 +0800 
org.apache.dolphinscheduler.service.log.LogClient:[222] - 
[WorkflowInstance-0][TaskInstance-111] - Get appIds: 
[application_1693365157704_0040] from worker: 
dolphinscheduler-worker-0.dolphinscheduler-worker-headless:1234 taskLogPath: 
/opt/dolphinscheduler/logs/20231103/10676729589025_23-107-111.log
   [INFO] 2023-11-03 15:52:14.686 +0800 
org.apache.dolphinscheduler.service.utils.ProcessUtils:[96] - 
[WorkflowInstance-0][TaskInstance-111] - get kerberos init command
   [INFO] 2023-11-03 15:52:14.687 +0800 
org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor:[144] - 
[WorkflowInstance-0][TaskInstance-111] - kill cmd:sudo -u hdfs sh 
/tmp/dolphinscheduler/exec/process/hdfs/10667691377184/10676729589025_23/107/111/application_1693365157704_0040.kill
   [ERROR] 2023-11-03 15:52:14.696 +0800 
org.apache.dolphinscheduler.server.worker.processor.TaskKillProcessor:[147] - 
[WorkflowInstance-0][TaskInstance-111] - Kill yarn application app id 
[application_1693365157704_0040] failed: 
[/tmp/dolphinscheduler/exec/process/hdfs/10667691377184/10676729589025_23/107/111/application_1693365157704_0040.kill:
 4: source: not found
   
/tmp/dolphinscheduler/exec/process/hdfs/10667691377184/10676729589025_23/107/111/application_1693365157704_0040.kill:
 7: yarn: not `found`
   
   ```
   
![image](https://github.com/apache/dolphinscheduler/assets/124865830/5af03333-2398-4d9c-973c-b9faed404b1f)
   
![image](https://github.com/apache/dolphinscheduler/assets/124865830/2d2d6011-1fb2-41cd-a7c9-ad8fe6dd4f4d)
   下面是关于value.xml的部分配置 ip与密码已涂抹
   conf:
     common:
       # user data local directory path, please make sure the directory exists 
and have read write permissions
       data.basedir.path: /tmp/dolphinscheduler
   
       # resource storage type: HDFS, S3, NONE
       resource.storage.type: S3
   
       # resource store on HDFS/S3 path, resource file will store to this base 
path, self configuration, please make sure the directory exists on hdfs and 
have read write permissions. "/dolphinscheduler" is recommended
       resource.storage.upload.base.path: /dolphinscheduler
   
       # whether to startup kerberos
       hadoop.security.authentication.startup.state: false
   
       # java.security.krb5.conf path
       java.security.krb5.conf.path: /opt/krb5.conf
   
       # login user from keytab username
       login.user.keytab.username: [email protected]
   
       # login user from keytab path
       login.user.keytab.path: /opt/hdfs.headless.keytab
   
       # kerberos expire time, the unit is hour
       kerberos.expire.time: 2
       # resource view suffixs
       #resource.view.suffixs: 
txt,log,sh,bat,conf,cfg,py,java,sql,xml,hql,properties,json,yml,yaml,ini,js
       # if resource.storage.type=HDFS, the user must have the permission to 
create directories under the HDFS root path
       resource.hdfs.root.user: hdfs
       # if resource.storage.type=S3, the value like: s3a://dolphinscheduler; 
if resource.storage.type=HDFS and namenode HA is enabled, you need to copy 
core-site.xml and hdfs-site.xml to conf dir
       resource.hdfs.fs.defaultFS: s3a://dolphinscheduler
       # The AWS access key. if resource.storage.type=S3 or use EMR-Task, This 
configuration is required
       resource.aws.access.key.id: admin
       # The AWS secret access key. if resource.storage.type=S3 or use 
EMR-Task, This configuration is required
       resource.aws.secret.access.key: xxxxxxx
       # The AWS Region to use. if resource.storage.type=S3 or use EMR-Task, 
This configuration is required
       resource.aws.region: cn-north-1
       # The name of the bucket. You need to create them by yourself. 
Otherwise, the system cannot start. All buckets in Amazon S3 share a single 
namespace; ensure the bucket is given a unique name.
       resource.aws.s3.bucket.name: dolphinscheduler
       # You need to set this parameter when private cloud s3. If S3 uses 
public cloud, you only need to set resource.aws.region or set to the endpoint 
of a public cloud such as S3.cn-north-1.amazonaws.com.cn
       resource.aws.s3.endpoint: http://10.200.x.xxx:9000
       # resourcemanager port, the default value is 8088 if not specified
       resource.manager.httpaddress.port: 8088
       # if resourcemanager HA is enabled, please set the HA IPs; if 
resourcemanager is single, keep this value empty
       yarn.resourcemanager.ha.rm.ids: 192.168.xx.xx,192.168.xx.xx
       # if resourcemanager HA is enabled or not use resourcemanager, please 
keep the default value; If resourcemanager is single, you only need to replace 
ds1 to actual resourcemanager hostname
       yarn.application.status.address: 
http://ndsc03.slave.com:%s/ws/v1/cluster/apps/%s
       # job history status url when application number threshold is 
reached(default 10000, maybe it was set to 1000)
       yarn.job.history.status.address: 
http://ndsc03.slave.com:19888/ws/v1/history/mapreduce/jobs/%s
   
       # datasource encryption enable
       datasource.encryption.enable: false
   
       # datasource encryption salt
       datasource.encryption.salt: '!@#$%^&*'
   
       # data quality option
       data-quality.jar.name: dolphinscheduler-data-quality-dev-SNAPSHOT.jar
   
       #data-quality.error.output.path: /tmp/data-quality-error-data
   
       # Network IP gets priority, default inner outer
   
       # Whether hive SQL is executed in the same session
       support.hive.oneSession: false
   
       # use sudo or not, if set true, executing user is tenant user and deploy 
user needs sudo permissions; if set false, executing user is the deploy user 
and doesn't need sudo permissions
       sudo.enable: true
   
       # network interface preferred like eth0, default: empty
       #dolphin.scheduler.network.interface.preferred:
   
       # network IP gets priority, default: inner outer
       #dolphin.scheduler.network.priority.strategy: default
   
       # system env path
       #dolphinscheduler.env.path: dolphinscheduler_env.sh
       # development state
       development.state: false
       # rpc port
       alert.rpc.port: 50052
       # Url endpoint for zeppelin RESTful API
       zeppelin.rest.url: http://localhost:8080
   
   ### What you expected to happen
   
   我认为是后端接口的问题
   
   ### How to reproduce
   
   使用helm部署dolphinscheduler， 
value.xml已配置持久化，worker-Pod内配置了spark-2.4.7-bin-hadoop2.7与hadoop-2.7.0.tar 环境 
并拷贝了外部的hadoop中的hadoop、yarn、hdfs等site.xml文件，在ui页面进行工作流定义，定义一个spark任务，当任务提交到外部的yarn上时，在工作流实例中进行停止任务，去yarn的ui页面上观察任务是否真的被kill，然后在worker的pod内查看日志
   kubectl logs -f dolphinscheduler-worker-0 --tail 500 -n namespace
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.1.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug]使用helm部署dolphinscheduler3.1.2-workerPod内配置了spark环境并与外部的hadoop集群进行关联 yarn任务kill不掉 [dolphinscheduler]

Reply via email to