asadjan4611 opened a new issue, #17990:
URL: https://github.com/apache/dolphinscheduler/issues/17990

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   In master task dispatching, WorkerGroupDispatcher.run() has no outer 
try-catch around the main while loop.
   
   If any unexpected exception escapes from workerGroupEventBus.take() or from 
task context access before inner dispatch handling, the dispatcher thread 
exits. After that, tasks for that worker group are no longer dispatched, 
causing a partial scheduling outage until process restart or thread recreation.
   
   Related code paths:
   - WorkerGroupDispatcher.run()
   - TaskDispatchableEventBus.take() (uses @SneakyThrows, so checked exceptions 
can be rethrown as unchecked)
   
   ### What you expected to happen
   
   WorkerGroupDispatcher should remain alive even if one iteration fails 
unexpectedly.
   
   Expected behavior:
   1. Catch any unexpected exception at loop boundary.
   2. Log error with worker-group context.
   3. Continue next iteration (or controlled recovery/restart strategy).
   4. Handle interruption explicitly and exit gracefully only when stopping.
   
   ### How to reproduce
   
   Deterministic local reproduction (fault-injection / test style):
   
   1. Start master and ensure WorkerGroupDispatcher thread is running.
   2. Trigger an unexpected exception from dispatcher loop boundary (for 
example from event-bus take path or malformed task event path).
   3. Observe dispatcher thread exits.
   4. Submit new tasks to the same worker group.
   5. Observe tasks remain undispatched while master process is still up.
   
   Code evidence:
   - WorkerGroupDispatcher.run() loop has no outer catch.
   - TaskDispatchableEventBus.take() is annotated with @SneakyThrows, so 
take-time exceptions can escape as unchecked.
   
   ### Anything else
   
   Potential impact:
   - Worker-group-level scheduling freeze
   - Silent throughput drop and task backlog growth
   - Hard-to-diagnose production incidents because master stays alive but one 
dispatcher thread is dead
   
   Relevant snippets:
   
   ```
   WorkerGroupDispatcher.run():
   while (runningFlag.get()) {
       TaskDispatchableEvent<ITaskExecutionRunnable> taskEntry = 
workerGroupEventBus.take();
       ITaskExecutionRunnable taskExecutionRunnable = taskEntry.getData();
       try (TaskExecutorMDCUtils.MDCAutoClosable ignore = 
TaskExecutorMDCUtils.logWithMDC(taskExecutionRunnable.getId())) {
           
LogUtils.setWorkflowInstanceIdMDC(taskExecutionRunnable.getTaskInstance().getWorkflowInstanceId());
           doDispatchTask(taskExecutionRunnable);
       } finally {
           LogUtils.removeWorkflowInstanceIdMDC();
       }
   }
   
   TaskDispatchableEventBus.take():
   @SneakyThrows
   public V take() {
       return super.take();
   }
   ```
   
   Suggested direction:
   - Add top-level try-catch in run loop.
   - Treat InterruptedException separately (restore interrupt flag and exit 
only when stopping).
   - Add health metric/log for dispatcher thread liveness.
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to