danielfree commented on issue #11227:
URL:
https://github.com/apache/dolphinscheduler/issues/11227#issuecomment-1248837649
> I think this has nothing to do with the master failover. Can you provide
the logs of the instance id 153 of the worker
Here's another case with my running installation:
instance id: 211
task instance id: 672
logs from worker:
```
[INFO] 2022-09-16 07:50:01.165
org.apache.dolphinscheduler.server.worker.processor.TaskExecuteProcessor:[110]
- received command :
TaskExecuteRequestCommand{taskExecutionContext='{"taskInstanceId":672,"taskName":"flightradar_top_routes","firstSubmitTime":"2022-09-16
07:50:01","startTime":null,"endTime":null,"taskType":"SHELL","host":null,"executePath":null,"logPath":null,"taskJson":null,"processId":0,"processDefineCode":6629036562401,"processDefineVersion":3,"appIds":null,"processInstanceId":211,"scheduleTime":"2022-09-16
07:50:00","globalParams":null,"executorId":2,"cmdTypeIfComplement":6,"tenantCode":"aspex","queue":"default","projectCode":6565237764192,"taskParams":"***","envFile":null,"environmentConfig":null,"definedParams":null,"taskAppId":null,"taskTimeoutStrategy":null,"taskTimeout":2147483647,"workerGroup":"default","delayTime":0,"currentExecutionStatus":null,"sqlTaskExecutionContext":{"warningGroupId":0,"connectionParams":null,"udfFuncTenantCodeMap":null,"defaultFS":null
},"dataxTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"dependenceTaskExecutionContext":null,"sqoopTaskExecutionContext":{"dataSourceId":0,"sourcetype":0,"sourceConnectionParams":null,"dataTargetId":0,"targetType":0,"targetConnectionParams":null},"varPool":null,"dryRun":0,"paramsMap":null,"procedureTaskExecutionContext":{"connectionParams":null}}'}
[INFO] 2022-09-16 07:50:01.167
org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success
/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.167
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task
instance local execute path :
/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.167
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the
task begins to execute. task instance id: 672
[INFO] 2022-09-16 07:50:01.265 TaskLogLogger-class
org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[60] - FINALIZE_SESSION
...
[INFO] 2022-09-16 07:50:01.287 TaskLogLogger-class
org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[138] - task execute
path : /tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:50:01.291 TaskLogLogger-class
org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[178] - process start,
process id is: 24435
07:50:42.439 [pool-2-thread-4] INFO
org.apache.dolphinscheduler.server.log.LoggerRequestProcessor - received
command : Command [type=VIEW_WHOLE_LOG_REQUEST, opaque=695701,
genCommandTimeMillis=0, bodyLen=82]
[INFO] 2022-09-16 07:53:58.443 TaskLogLogger-class
org.apache.dolphinscheduler.plugin.task.shell.ShellTask:[202] - process has
exited, execute
path:/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672,
processId:24435 ,exitStatusCode:0 ,processWaitForStatus:true ,processExitValue:0
[INFO] 2022-09-16 07:53:58.444
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[206] - task
instance id : 672,task final status : SUCCESS
[INFO] 2022-09-16 07:53:58.444
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[264] -
develop mode is: false
[INFO] 2022-09-16 07:53:58.445
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[282] - exec
local path:
/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
cleared.
[INFO] 2022-09-16 07:53:59.196
org.apache.dolphinscheduler.common.utils.FileUtils:[121] - create dir success
/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:53:59.196
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[128] - task
instance local execute path :
/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672
[INFO] 2022-09-16 07:53:59.196
org.apache.dolphinscheduler.server.worker.runner.TaskExecuteThread:[146] - the
task begins to execute. task instance id: 672
```
Note that at **[INFO] 2022-09-16 07:53:59.196** the task instance 672 is
executed again after a SUCCESS run
master logs:
```
[INFO] 2022-09-16 07:50:00.012
org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[255] -
find command 212, slot:0 :
[INFO] 2022-09-16 07:50:00.013
org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[208] -
find one command: id: 212, type: SCHEDULER
[INFO] 2022-09-16 07:50:00.016
org.apache.dolphinscheduler.server.master.runner.MasterSchedulerService:[221] -
handle command end, command 212 process 211 start...
[INFO] 2022-09-16 07:50:00.025
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] -
add task to stand by list, task name:ssh-check, task id:0, task
code:6629036562272
[INFO] 2022-09-16 07:50:00.027
org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start
submit task : ssh-check, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:00.031
org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit
task to db successfully:671 ssh-check state:SUBMITTED_SUCCESS complete,
instance id:211 state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:00.033
org.apache.dolphinscheduler.service.process.ProcessService:[1584] - updated
resource info
{"id":2,"res":"ec2_aspexeng.pem","resourceName":"/pem/ec2_aspexeng.pem"}
[INFO] 2022-09-16 07:50:00.037
org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164]
- master submit success, task id:671, task name:ssh-check, process id:211
[INFO] 2022-09-16 07:50:00.039
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1374] -
remove task from stand by list, id: 671 name:ssh-check
[INFO] 2022-09-16 07:50:00.042
org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] -
taskAckCommand : TaskExecuteAckCommand{taskInstanceId=671, startTime=Fri Sep 16
07:50:00 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234',
status=1,
logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/671.log',
executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/671',
processInstanceId='0'}
[INFO] 2022-09-16 07:50:00.042
org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] -
taskAckCommand : TaskExecuteAckCommand{taskInstanceId=671, startTime=Fri Sep 16
07:50:00 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234',
status=1,
logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/671.log',
executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/671',
processInstanceId='211'}
[INFO] 2022-09-16 07:50:00.964
org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72]
- received command : TaskExecuteResponseCommand{taskInstanceId=671,
processInstanceId=211, status=7, startTime=null, endTime=Fri Sep 16 07:50:00
CST 2022, host=null, logPath=null, executePath=null, processId=24426,
appIds='', varPool=[]}
[INFO] 2022-09-16 07:50:01.022
org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223]
- already exists handler process size:0
[INFO] 2022-09-16 07:50:01.026
org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230]
- persist events 211 succeeded.
[INFO] 2022-09-16 07:50:01.144
org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] -
handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:50:01.145
org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] -
already exists handler process size:0
[INFO] 2022-09-16 07:50:01.145
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] -
process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus:
SUCCESS task instance id: 671 process instance id: 211 context: null
[INFO] 2022-09-16 07:50:01.146
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] -
work flow 211 task 671 state:SUCCESS
[INFO] 2022-09-16 07:50:01.150
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] -
add task to stand by list, task name:flightradar_top_routes, task id:0, task
code:6629036562400
[INFO] 2022-09-16 07:50:01.150
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1033] -
taskCode: 6629036562400 completeDependTaskList: [6629036562272]
[INFO] 2022-09-16 07:50:01.151
org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start
submit task : flightradar_top_routes, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:01.153
org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit
task to db successfully:672 flightradar_top_routes state:SUBMITTED_SUCCESS
complete, instance id:211 state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:50:01.156
org.apache.dolphinscheduler.service.process.ProcessService:[1584] - updated
resource info
{"id":2,"res":"ec2_aspexeng.pem","resourceName":"/pem/ec2_aspexeng.pem"}
[INFO] 2022-09-16 07:50:01.159
org.apache.dolphinscheduler.server.master.runner.task.CommonTaskProcessor:[164]
- master submit success, task id:672, task name:flightradar_top_routes, process
id:211
[INFO] 2022-09-16 07:50:01.161
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1374] -
remove task from stand by list, id: 672 name:flightradar_top_routes
[INFO] 2022-09-16 07:50:01.168
org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] -
taskAckCommand : TaskExecuteAckCommand{taskInstanceId=672, startTime=Fri Sep 16
07:50:01 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234',
status=1,
logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log',
executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672',
processInstanceId='0'}
[INFO] 2022-09-16 07:50:01.168
org.apache.dolphinscheduler.server.master.processor.TaskAckProcessor:[71] -
taskAckCommand : TaskExecuteAckCommand{taskInstanceId=672, startTime=Fri Sep 16
07:50:01 CST 2022, host='dolphin-worker-1.dolphin-worker-headless:1234',
status=1,
logPath='/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log',
executePath='/tmp/dolphinscheduler/exec/process/6565237764192/6629036562401_3/211/672',
processInstanceId='211'}
[INFO] 2022-09-16 07:50:02.022
org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223]
- already exists handler process size:0
[INFO] 2022-09-16 07:50:02.025
org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230]
- persist events 211 succeeded.
[INFO] 2022-09-16 07:50:41.424
org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] -
failover execute started
[INFO] 2022-09-16 07:50:41.426
org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] -
need failover hosts:[dolphin-master-1.dolphin-master-headless:5678]
[INFO] 2022-09-16 07:50:41.429
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] -
start master[dolphin-master-1.dolphin-master-headless:5678] failover, process
list size:2
[INFO] 2022-09-16 07:50:41.431
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] -
failover task instance id: 672, process instance id: 211
[INFO] 2022-09-16 07:50:41.424
org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[68] -
failover execute started
[INFO] 2022-09-16 07:50:41.426
org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread:[74] -
need failover hosts:[dolphin-master-1.dolphin-master-headless:5678]
[INFO] 2022-09-16 07:50:41.429
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[424] -
start master[dolphin-master-1.dolphin-master-headless:5678] failover, process
list size:2
**[INFO] 2022-09-16 07:50:41.431**
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[442] -
failover task instance id: 672, process instance id: 211
[INFO] 2022-09-16 07:50:42.432
org.apache.dolphinscheduler.service.log.LogClientService:[117] - view log path
/opt/apache-dolphinscheduler-2.0.6-bin/logs/6629036562401_3/211/672.log
[INFO] 2022-09-16 07:50:42.441
org.apache.dolphinscheduler.remote.NettyRemotingClient:[390] - netty client
closed
[INFO] 2022-09-16 07:50:42.445
org.apache.dolphinscheduler.service.log.LogClientService:[74] - logger client
closed
[ERROR] 2022-09-16 07:50:42.445
org.apache.dolphinscheduler.server.utils.ProcessUtils:[211] - kill yarn job
failure
java.lang.NullPointerException: null
at
org.apache.dolphinscheduler.server.utils.ProcessUtils.killYarnJob(ProcessUtils.java:197)
at
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverTaskInstance(MasterRegistryClient.java:490)
at
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient.failoverMaster(MasterRegistryClient.java:443)
at
org.apache.dolphinscheduler.server.master.runner.FailoverExecuteThread.run(FailoverExecuteThread.java:80)
[INFO] 2022-09-16 07:50:42.447
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[456] -
master[dolphin-master-1.dolphin-master-headless:5678] failover end,
useTime:1019ms
[INFO] 2022-09-16 07:50:43.147
org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] -
handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:50:43.148
org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] -
already exists handler process size:0
[INFO] 2022-09-16 07:50:43.148
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] -
process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus:
NEED_FAULT_TOLERANCE task instance id: 672 process instance id: 211 context:
null
[INFO] 2022-09-16 07:50:43.149
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] -
work flow 211 task 672 state:NEED_FAULT_TOLERANCE
[INFO] 2022-09-16 07:50:43.150
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1360] -
add task to stand by list, task name:flightradar_top_routes, task id:672, task
code:6629036562400
**[INFO] 2022-09-16 07:53:58.445**
org.apache.dolphinscheduler.server.master.processor.TaskResponseProcessor:[72]
- received command : TaskExecuteResponseCommand{taskInstanceId=672,
processInstanceId=211, status=7, startTime=null, endTime=Fri Sep 16 07:53:58
CST 2022, host=null, logPath=null, executePath=null, processId=24435,
appIds='', varPool=[]}
**[INFO] 2022-09-16 07:53:59.041**
org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[223]
- already exists handler process size:0
[INFO] 2022-09-16 07:53:59.044
org.apache.dolphinscheduler.server.master.processor.queue.TaskResponseService:[230]
- persist events 211 succeeded.
[INFO] 2022-09-16 07:53:59.168
org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[127] -
handle process instance : 211 , events count:1
[INFO] 2022-09-16 07:53:59.168
org.apache.dolphinscheduler.server.master.runner.EventExecuteService:[130] -
already exists handler process size:0
**[INFO] 2022-09-16 07:53:59.168**
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[308] -
process event: State Event :key: null type: TASK_STATE_CHANGE executeStatus:
SUCCESS task instance id: 672 process instance id: 211 context: null
[INFO] 2022-09-16 07:53:59.169
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[422] -
work flow 211 task 672 state:SUCCESS
[INFO] 2022-09-16 07:53:59.172
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThread:[1033] -
taskCode: 6629036562400 completeDependTaskList: [6629036562272, 6629036562400]
[INFO] 2022-09-16 07:53:59.173
org.apache.dolphinscheduler.service.process.ProcessService:[1088] - start
submit task : flightradar_top_routes, instance id:211, state: RUNNING_EXECUTION
[INFO] 2022-09-16 07:53:59.175
org.apache.dolphinscheduler.service.process.ProcessService:[1101] - end submit
task to db successfully:672 flightradar_top_routes state:SUBMITTED_SUCCESS
complete, instance id:211 state: RUNNING_EXECUTION
```
Note at **[INFO] 2022-09-16 07:50:41.431** the master somehow starts to
failover with running task 672 in worker, and in **[INFO] 2022-09-16
07:53:59.168** the task status is SUCCESS but the master submits the task again
in **[INFO] 2022-09-16 07:53:59.173**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]