reele opened a new issue, #17355: URL: https://github.com/apache/dolphinscheduler/issues/17355
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened the eventChannel in TaskExecutorLifecycleEventRemoteReporter will be removed when it is empty: ```java public void receiveTaskExecutorLifecycleEventACK(final TaskExecutorLifecycleEventAck eventAck) { final int taskExecutorId = eventAck.getTaskExecutorId(); eventChannelsLock.lock(); try { final ReportableTaskExecutorLifecycleEventChannel eventChannel = eventChannels.get(taskExecutorId); if (eventChannel == null) { return; } final IReportableTaskExecutorLifecycleEvent removed = eventChannel.remove(eventAck.getTaskExecutorLifecycleEventType()); if (removed != null) { log.info("Success removed {} by ack: {}", removed, eventAck); } else { log.info("Failed removed ReportableTaskExecutorLifecycleEvent by ack: {}", eventAck); } %%%%%%%%%% here the channel was removed %%%%%%%%%% if (eventChannel.isEmpty()) { eventChannels.remove(taskExecutorId); log.debug("Removed ReportableTaskExecutorLifecycleEventChannel: {}", taskExecutorId); } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% taskExecutionEventEmptyCondition.signalAll(); } finally { eventChannelsLock.unlock(); } } ``` so if there is no event waiting for report, reassignWorkflowInstanceHost will return false. ```java public boolean reassignWorkflowInstanceHost(int taskInstanceId, String workflowHost) { eventChannelsLock.lock(); try { %%%%%%%%%% here cannot get channel if there is no event %%%%%%%%%% final ReportableTaskExecutorLifecycleEventChannel eventChannel = eventChannels.get(taskInstanceId); if (eventChannel == null) { return false; } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% eventChannel.taskExecutionEventsQueue.forEach(event -> event.setWorkflowInstanceHost(workflowHost)); return true; } finally { eventChannelsLock.unlock(); } } ``` so if master restart during the task's execution, the master failover will failed to take-over the old task-instance, and then create a new task-instance and `TaskExecutionRunnable`, so now if the old task finished, the event will stuck in eventChannel, because the old TaskExecutionRunnable is not exists: ``` [WI-0][TI-28459975] - 2025-07-22 16:39:21.465 ERROR [PhysicalTaskExecutorLifecycleEventReporter] o.a.d.e.m.TaskExecutorEventRemoteReporterClient:[68] - Report ITaskExecutorLifecycleEvent: TaskExecutorSuccessLifecycleEvent(super=AbstractTaskExecutorLifecycleEvent(super=AbstractDelayEvent(delayTime=0, createTimeInNano=544710287716565, expiredTimeInNano=544710287716816), taskInstanceId=28459975, eventCreateTime=1753173561446, type=SUCCESS), workflowInstanceId=4842042, workflowInstanceHost=10.0.6.23:5678, taskInstanceHost=10.0.6.23:1234, endTime=1753173561446, varPool=[], latestReportTime=1753173561455) to master failed org.apache.dolphinscheduler.extract.base.exception.MethodInvocationException: Cannot find the TaskExecuteRunnable: 28459975 at org.apache.dolphinscheduler.extract.base.exception.MethodInvocationException.of(MethodInvocationException.java:27) at org.apache.dolphinscheduler.extract.base.client.SyncClientMethodInvoker.invoke(SyncClientMethodInvoker.java:53) at org.apache.dolphinscheduler.extract.base.client.ClientInvocationHandler.invoke(ClientInvocationHandler.java:56) at com.sun.proxy.$Proxy152.onTaskExecutorSuccess(Unknown Source) at org.apache.dolphinscheduler.extract.master.TaskExecutorEventRemoteReporterClient.reportTaskSuccessEventToMaster(TaskExecutorEventRemoteReporterClient.java:118) at org.apache.dolphinscheduler.extract.master.TaskExecutorEventRemoteReporterClient.reportTaskExecutionEventToMaster(TaskExecutorEventRemoteReporterClient.java:61) at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorLifecycleEventRemoteReporter.handleTaskExecutionEventChannel(TaskExecutorLifecycleEventRemoteReporter.java:180) at org.apache.dolphinscheduler.task.executor.eventbus.TaskExecutorLifecycleEventRemoteReporter.run(TaskExecutorLifecycleEventRemoteReporter.java:84) ``` ### What you expected to happen - ### How to reproduce - ### Anything else _No response_ ### Version dev ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
