asadjan4611 opened a new issue, #17990: URL: https://github.com/apache/dolphinscheduler/issues/17990
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened In master task dispatching, WorkerGroupDispatcher.run() has no outer try-catch around the main while loop. If any unexpected exception escapes from workerGroupEventBus.take() or from task context access before inner dispatch handling, the dispatcher thread exits. After that, tasks for that worker group are no longer dispatched, causing a partial scheduling outage until process restart or thread recreation. Related code paths: - WorkerGroupDispatcher.run() - TaskDispatchableEventBus.take() (uses @SneakyThrows, so checked exceptions can be rethrown as unchecked) ### What you expected to happen WorkerGroupDispatcher should remain alive even if one iteration fails unexpectedly. Expected behavior: 1. Catch any unexpected exception at loop boundary. 2. Log error with worker-group context. 3. Continue next iteration (or controlled recovery/restart strategy). 4. Handle interruption explicitly and exit gracefully only when stopping. ### How to reproduce Deterministic local reproduction (fault-injection / test style): 1. Start master and ensure WorkerGroupDispatcher thread is running. 2. Trigger an unexpected exception from dispatcher loop boundary (for example from event-bus take path or malformed task event path). 3. Observe dispatcher thread exits. 4. Submit new tasks to the same worker group. 5. Observe tasks remain undispatched while master process is still up. Code evidence: - WorkerGroupDispatcher.run() loop has no outer catch. - TaskDispatchableEventBus.take() is annotated with @SneakyThrows, so take-time exceptions can escape as unchecked. ### Anything else Potential impact: - Worker-group-level scheduling freeze - Silent throughput drop and task backlog growth - Hard-to-diagnose production incidents because master stays alive but one dispatcher thread is dead Relevant snippets: ``` WorkerGroupDispatcher.run(): while (runningFlag.get()) { TaskDispatchableEvent<ITaskExecutionRunnable> taskEntry = workerGroupEventBus.take(); ITaskExecutionRunnable taskExecutionRunnable = taskEntry.getData(); try (TaskExecutorMDCUtils.MDCAutoClosable ignore = TaskExecutorMDCUtils.logWithMDC(taskExecutionRunnable.getId())) { LogUtils.setWorkflowInstanceIdMDC(taskExecutionRunnable.getTaskInstance().getWorkflowInstanceId()); doDispatchTask(taskExecutionRunnable); } finally { LogUtils.removeWorkflowInstanceIdMDC(); } } TaskDispatchableEventBus.take(): @SneakyThrows public V take() { return super.take(); } ``` Suggested direction: - Add top-level try-catch in run loop. - Treat InterruptedException separately (restore interrupt flag and exit only when stopping). - Add health metric/log for dispatcher thread liveness. ### Version dev ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
