quanzhian opened a new issue #5461:
URL: https://github.com/apache/dolphinscheduler/issues/5461
org.apache.dolphinscheduler.server.utils.ProcessUtil.java 类下
public static void kill(TaskExecutionContext taskExecutionContext) {
try {
int processId = taskExecutionContext.getProcessId();
if (processId == 0) {
logger.error("process kill failed, process id :{}, task
id:{}",
processId, taskExecutionContext.getTaskInstanceId());
return;
}
// 此处的getPidsStr(processId)得到的进程PID有时候无法拿到,导致执行kill -9
命令报错,需要官方进行一个空值判断
String cmd = String.format("sudo kill -9 %s",
getPidsStr(processId));
logger.info("process id:{}, cmd:{}", processId, cmd);
OSUtils.exeCmd(cmd);
} catch (Exception e) {
logger.error("kill task failed", e);
}
// find log and kill yarn job
killYarnJob(taskExecutionContext);
}
异常日志信息如下:
[INFO] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[347] - task
run command:
sudo -u dolphinscheduler sh
/tmp/dolphinscheduler/exec/process/2/107/71/109/107_71_109.command
[INFO] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[228] -
process start, process id is: 11319
[INFO] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[237] -
process has exited, execute
path:/tmp/dolphinscheduler/exec/process/2/107/71/109, processId:11319
,exitStatusCode:0
[ERROR] 2021-05-12 18:28:32.942 - [taskAppId=TASK-107-71-109]:[256] -
process has failure , exitStatusCode : 0 , ready to kill ...
[INFO] 2021-05-12 18:28:32.969
org.apache.dolphinscheduler.server.utils.ProcessUtils:[373] - process id:11319,
cmd:sudo kill -9
[ERROR] 2021-05-12 18:28:32.981
org.apache.dolphinscheduler.server.utils.ProcessUtils:[378] - kill task failed
org.apache.dolphinscheduler.common.shell.AbstractShell$ExitCodeException:
Usage:
kill [options] <pid|name> [...]
Options:
-a, --all do not restrict the name-to-pid conversion to
processes
with the same uid as the present process
-s, --signal <sig> send specified signal
-q, --queue <sig> use sigqueue(2) rather than kill(2)
-p, --pid print pids without signaling them
-l, --list [=<signal>] list signal names, or convert one to a name
-L, --table list signal names and numbers
-h, --help display this help and exit
-V, --version output version information and exit
For more details see kill(1).
at
org.apache.dolphinscheduler.common.shell.AbstractShell.runCommand(AbstractShell.java:209)
at
org.apache.dolphinscheduler.common.shell.AbstractShell.run(AbstractShell.java:124)
at
org.apache.dolphinscheduler.common.shell.ShellExecutor.execute(ShellExecutor.java:127)
at
org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:104)
at
org.apache.dolphinscheduler.common.shell.ShellExecutor.execCommand(ShellExecutor.java:87)
at
org.apache.dolphinscheduler.common.utils.OSUtils.exeShell(OSUtils.java:394)
at
org.apache.dolphinscheduler.common.utils.OSUtils.exeCmd(OSUtils.java:384)
at
org.apache.dolphinscheduler.server.utils.ProcessUtils.kill(ProcessUtils.java:375)
at
org.apache.dolphinscheduler.server.worker.task.AbstractCommandExecutor.run(AbstractCommandExecutor.java:257)
at
org.apache.dolphinscheduler.server.worker.task.qtdataIntegration.QtDiTask.handle(QtDiTask.java:166)
at
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread.run(TaskExecuteThread.java:134)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[INFO] 2021-05-12 18:28:33.943 - [taskAppId=TASK-107-71-109]:[129] - ->
flinkx starting ...
18:28:33.378 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: jobmanager.rpc.address, localhost
18:28:33.381 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: jobmanager.rpc.port, 6123
18:28:33.381 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: jobmanager.heap.size, 1024m
18:28:33.381 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: taskmanager.heap.size, 1024m
18:28:33.381 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: taskmanager.numberOfTaskSlots, 1
18:28:33.381 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: parallelism.default, 1
[INFO] 2021-05-12 18:28:33.982
org.apache.dolphinscheduler.service.log.LogClientService:[100] - view log path
/mnt/services/dolphinscheduler136/logs/107/71/109.log
[INFO] 2021-05-12 18:28:33.988
org.apache.dolphinscheduler.remote.NettyRemotingClient:[403] - netty client
closed
[INFO] 2021-05-12 18:28:33.988
org.apache.dolphinscheduler.service.log.LogClientService:[59] - logger client
closed
[INFO] 2021-05-12 18:28:33.989
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[142] - task
instance id : 109,task final status : FAILURE
[INFO] 2021-05-12 18:28:33.989
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[162] -
develop mode is: false
[INFO] 2021-05-12 18:28:33.989
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[180] - exec
local path: /tmp/dolphinscheduler/exec/process/2/107/71/109 cleared.
[INFO] 2021-05-12 18:28:34.944 - [taskAppId=TASK-107-71-109]:[129] - ->
18:28:34.159 [main] INFO com.dtstack.flinkx.launcher.perjob.PerJobSubmitter -
start to submit per-job task, LauncherOptions = Options{mode='yarnPer',
job='/tmp/dolphinscheduler/exec/process/2/107/71/109/107_71_109_job.json',
monitor='null', jobid='Flink Job', flinkconf='/mnt/services/flink-1.8.8/conf',
pluginRoot='/mnt/services/flinkx/plugins', remotePluginPath='null',
yarnconf='/etc/hadoop/conf', parallelism='1', priority='1', queue='default',
flinkLibJar='/mnt/services/flink-1.8.8/lib',
confProp='{"flink.checkpoint.interval":60000}', p='', s='null',
pluginLoadMode='shipfile', appId='null'}
18:28:34.167 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: jobmanager.rpc.address, localhost
18:28:34.167 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: jobmanager.rpc.port, 6123
18:28:34.167 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: jobmanager.heap.size, 1024m
18:28:34.167 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: taskmanager.heap.size, 1024m
18:28:34.167 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: taskmanager.numberOfTaskSlots, 1
18:28:34.167 [main] INFO
org.apache.flink.configuration.GlobalConfiguration - Loading configuration
property: parallelism.default, 1
18:28:34.305 [main] WARN org.apache.hadoop.util.NativeCodeLoader -
Unable to load native-hadoop library for your platform... using builtin-java
classes where applicable
18:28:34.360 [main] INFO
org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to
dolphinscheduler (auth:SIMPLE)
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.yarn.ipc.YarnRPC).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.
18:28:34.543 [main] INFO
com.dtstack.flinkx.launcher.perjob.PerJobClusterClientBuilder - ----init yarn
success ----
18:28:34.666 [main] INFO org.apache.hadoop.conf.Configuration -
resource-types.xml not found
18:28:34.666 [main] INFO
org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find
'resource-types.xml'.
18:28:34.704 [main] WARN
org.apache.flink.yarn.AbstractYarnClusterDescriptor - The JobManager or
TaskManager memory is below the smallest possible YARN Container size. The
value of 'yarn.scheduler.minimum-allocation-mb' is '1024'. Please increase the
memory size.YARN will allocate the smaller containers but the scheduler will
account for the minimum-allocation-mb, maybe not all instances you requested
will start.
18:28:34.704 [main] INFO
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification:
ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=1024,
numberTaskManagers=1, slotsPerTaskManager=1}
[INFO] 2021-05-12 18:28:35.945 - [taskAppId=TASK-107-71-109]:[129] - ->
18:28:35.024 [main] WARN
org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory - The short-circuit
local reads feature cannot be used because libhadoop cannot be loaded.
18:28:35.033 [main] WARN
org.apache.flink.yarn.AbstractYarnClusterDescriptor - The configuration
directory ('/mnt/services/flink-1.8.8/conf') contains both LOG4J and Logback
configuration files. Please delete or rename one of them.
[INFO] 2021-05-12 18:28:36.946 - [taskAppId=TASK-107-71-109]:[129] - ->
18:28:36.716 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor -
Submitting application master application_1609329939009_5348
18:28:36.741 [main] INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application
application_1609329939009_5348
18:28:36.742 [main] INFO
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting for the cluster
to be allocated
18:28:36.744 [main] INFO
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying cluster,
current state ACCEPTED
[INFO] 2021-05-12 18:28:40.946 - [taskAppId=TASK-107-71-109]:[129] - ->
18:28:40.025 [main] INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor -
YARN application has been deployed successfully.
18:28:40.320 [main] INFO org.apache.flink.runtime.rest.RestClient -
Rest client endpoint started.
18:28:40.323 [main] INFO com.dtstack.flinkx.util.YarnUtil -
HADOOP_CONF_DIR:/etc/hadoop/conf
18:28:40.372 [main] INFO com.dtstack.flinkx.util.YarnUtil - get 1080
config from /etc/hadoop/conf/core-site.xml
18:28:40.380 [main] INFO com.dtstack.flinkx.util.YarnUtil - get 23
config from /etc/hadoop/conf/hdfs-site.xml
18:28:40.400 [main] INFO com.dtstack.flinkx.util.YarnUtil - hdfs
path:hdfs:///apps/flinkx/2021-05-12/816d1ef47c5a5cbd5557580126b17f22
18:28:40.401 [main] INFO com.dtstack.flinkx.util.YarnUtil -
monitorUrl:bigdata-master01:8088/proxy/application_1609329939009_5348
18:28:40.421 [main] INFO
com.dtstack.flinkx.launcher.perjob.PerJobSubmitter - deploy per_job with appId:
application_1609329939009_5348}, jobId: 816d1ef47c5a5cbd5557580126b17f22
[INFO] 2021-05-12 18:28:40.947 - [taskAppId=TASK-107-71-109]:[127] -
FINALIZE_SESSION
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]