JadenQ opened a new issue, #14428: URL: https://github.com/apache/dolphinscheduler/issues/14428
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened ### Brief w2#task1 depend on w1#task2 to work with timeout fail turned out (e.g. timeout after 1 min), when w1#task1 is not executed (out of dependent interval) , w2#task1 failed because of timeout, the failure information will not be reported to w2, so the w2 stuck on *running*, which is not prefered, since w2 is not even aware of the failure of task1. Timeout failure somehow doesn't change taskStateEvent, then in the TaskStateEventHandler, this situation falls into a error condition logging as: ` if (task.getState().isFinished() && (taskStateEvent.getStatus() != null && taskStateEvent.getStatus().isRunning())) { String errorMessage = String.format( "The current task instance state is %s, but the task state event status is %s, so the task state event will be ignored", task.getState().name(), taskStateEvent.getStatus().name()); log.warn(errorMessage); throw new StateEventHandleError(errorMessage); } ` The w2 will not response, so stuck on running, this could also lead to further misleading situation if not handled. ### Replay 1. Create a workflow w1 with a echo shell task w1#task1 2. Create a workflow w2 with a w2#task1 dependent on w1#task1, and a echo shell task w2#task2 following w2#task1. 3. Set w2#task1 retry time as 0, turn on timeout fail (超时失败), w2#task1 depend on w1#task1, dependent relation is AND.  4. Start w2#task1 directly. ### Version related Since 3.0.x to dev, not modified. ### What you expected to happen w2 should know task1 is failed and workflowExecuteRunnable.taskFinished should be called. i.e. If w2#task1 failed, w2 should be failed *instead of* kept in running situation. Workflow should know a task a failed and use workflowExecuteRunnable.taskFinished to handle further operations, like task retry, failover, workflow status change etc. ### How to reproduce ### Replay 1. Create a workflow w1 with a echo shell task w1#task1 2. Create a workflow w2 with a w2#task1 dependent on w1#task1, and a echo shell task w2#task2 following w2#task1. 3. Set w2#task1 retry time as 0, turn on timeout fail (超时失败), w2#task1 depend on w1#task1, dependent relation is AND.  4. Start w2#task1 directly. ### Anything else [INFO] 2023-06-21 13:47:50.588 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-0][TaskInstance-15] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-06-21 13:47:50.669 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[287] - [WorkflowInstance-15][TaskInstance-21] - Begin to handle state event, TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-06-21 13:47:50.669 +0800 org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[57] - [WorkflowInstance-15][TaskInstance-21] - Handle task instance state event, the current task instance state TaskExecutionStatus{code=1, desc='running'} will be changed to TaskExecutionStatus{code=1, desc='running'} [INFO] 2023-06-21 13:47:55.588 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-0][TaskInstance-15] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-06-21 13:47:55.681 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[287] - [WorkflowInstance-15][TaskInstance-21] - Begin to handle state event, TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-06-21 13:47:55.681 +0800 org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[57] - [WorkflowInstance-15][TaskInstance-21] - Handle task instance state event, the current task instance state TaskExecutionStatus{code=1, desc='running'} will be changed to TaskExecutionStatus{code=1, desc='running'} [INFO] 2023-06-21 13:47:58.957 +0800 org.apache.dolphinscheduler.server.master.task.MasterHeartBeatTask:[70] - [WorkflowInstance-0][TaskInstance-0] - Success write master heartBeatInfo into registry, masterRegistryPath: /nodes/master/172.22.21.6:5678, heartBeatInfo: {"startupTime":1687325447045,"reportTime":1687326478929,"cpuUsage":0.0,"memoryUsage":0.41,"loadAverage":0.0,"availablePhysicalMemorySize":9.06,"maxCpuloadAvg":32.0,"reservedMemory":0.3,"diskAvailable":1002.83,"processId":3160} [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[281] - [WorkflowInstance-15][TaskInstance-15] - Task instance is timeout, adding task timeout event and remove the check [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-15][TaskInstance-15] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=null, type=TASK_TIMEOUT, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[281] - [WorkflowInstance-15][TaskInstance-15] - Task instance is timeout, adding task timeout event and remove the check [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-15][TaskInstance-15] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=null, type=TASK_TIMEOUT, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-0][TaskInstance-15] - Submit state event success, stateEvent: TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[157] - [WorkflowInstance-15][TaskInstance-15] - Workflow instance 15 timeout, adding timeout event [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[97] - [WorkflowInstance-15][TaskInstance-15] - Submit state event success, stateEvent: WorkflowStateEvent(processInstanceId=15, taskInstanceId=null, status=null, type=PROCESS_TIMEOUT, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.589 +0800 org.apache.dolphinscheduler.server.master.runner.StateWheelExecuteThread:[160] - [WorkflowInstance-15][TaskInstance-15] - Workflow instance timeout, added timeout event [INFO] 2023-06-21 13:48:00.593 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[287] - [WorkflowInstance-15][TaskInstance-21] - Begin to handle state event, TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=null, type=TASK_TIMEOUT, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.593 +0800 org.apache.dolphinscheduler.server.master.event.TaskTimeoutStateEventHandler:[53] - [WorkflowInstance-15][TaskInstance-21] - Handle task instance state timout event, taskInstanceId: 21 [INFO] 2023-06-21 13:48:00.594 +0800 TaskLogLogger-class org.apache.dolphinscheduler.server.master.runner.task.DependentTaskProcessor:[151] - [WorkflowInstance-15][TaskInstance-21] - dependent taskInstanceId: 21 timeout, taskName: w2, strategy: warnfailed [INFO] 2023-06-21 13:48:00.652 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[287] - [WorkflowInstance-15][TaskInstance-21] - Begin to handle state event, TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=null, type=TASK_TIMEOUT, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.652 +0800 org.apache.dolphinscheduler.server.master.event.TaskTimeoutStateEventHandler:[53] - [WorkflowInstance-15][TaskInstance-21] - Handle task instance state timout event, taskInstanceId: 21 [INFO] 2023-06-21 13:48:00.682 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[287] - [WorkflowInstance-15][TaskInstance-21] - Begin to handle state event, TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:00.682 +0800 org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[57] - [WorkflowInstance-15][TaskInstance-21] - Handle task instance state event, the current task instance state TaskExecutionStatus{code=6, desc='failure'} will be changed to TaskExecutionStatus{code=1, desc='running'} [WARN] 2023-06-21 13:48:00.682 +0800 org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[68] - [WorkflowInstance-15][TaskInstance-21] - The current task instance state is TaskExecutionStatus{code=6, desc='failure'}, but the task state event status is TaskExecutionStatus{code=1, desc='running'}, so the task state event will be ignored [ERROR] 2023-06-21 13:48:00.682 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[292] - [WorkflowInstance-15][TaskInstance-21] - State event handle error, will remove this event: TaskStateEvent(processInstanceId=15, taskInstanceId=21, taskCode=9960682788640, status=TaskExecutionStatus{code=1, desc='running'}, type=TASK_STATE_CHANGE, key=null, channel=null, context=null) org.apache.dolphinscheduler.server.master.event.StateEventHandleError: The current task instance state is TaskExecutionStatus{code=6, desc='failure'}, but the task state event status is TaskExecutionStatus{code=1, desc='running'}, so the task state event will be ignored at org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler.handleStateEvent(TaskStateEventHandler.java:69) at org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.handleEvents(WorkflowExecuteRunnable.java:288) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [INFO] 2023-06-21 13:48:01.683 +0800 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[287] - [WorkflowInstance-15][TaskInstance-null] - Begin to handle state event, WorkflowStateEvent(processInstanceId=15, taskInstanceId=null, status=null, type=PROCESS_TIMEOUT, key=null, channel=null, context=null) [INFO] 2023-06-21 13:48:01.683 +0800 org.apache.dolphinscheduler.server.master.event.WorkflowTimeoutStateEventHandler:[35] - [WorkflowInstance-15][TaskInstance-null] - Handle workflow instance timeout event [INFO] 2023-06-21 13:48:08.976 +0800 org.apache.dolphinscheduler.server.master.task.MasterHeartBeatTask:[70] - [WorkflowInstance-0][TaskInstance-0] - Success write master heartBeatInfo into registry, masterRegistryPath: /nodes/master/172.22.21.6:5678, heartBeatInfo: {"startupTime":1687325447045,"reportTime":1687326488957,"cpuUsage":0.0,"memoryUsage":0.41,"loadAverage":0.0,"availablePhysicalMemorySize":9.05,"maxCpuloadAvg":32.0,"reservedMemory":0.3,"diskAvailable":1002.83,"processId":3160} [INFO] 2023-06-21 13:48:14.507 +0800 org.apache.dolphinscheduler.service.log.LoggerRequestProcessor:[79] - [WorkflowInstance-0][TaskInstance-0] - received command : Command [type=ROLL_VIEW_LOG_REQUEST, opaque=1252, bodyLen=132] [INFO] 2023-06-21 13:48:18.994 +0800 org.apache.dolphinscheduler.server.master.task.MasterHeartBeatTask:[70] - [WorkflowInstance-0][TaskInstance-0] - Success write master heartBeatInfo into registry, masterRegistryPath: /nodes/master/172.22.21.6:5678, heartBeatInfo: {"startupTime":1687325447045,"reportTime":1687326498976,"cpuUsage":0.0,"memoryUsage":0.42,"loadAverage":0.0,"availablePhysicalMemorySize":9.05,"maxCpuloadAvg":32.0,"reservedMemory":0.3,"diskAvailable":1002.83,"processId":3160} ### Version dev ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
