crazychengmm opened a new issue, #17884:
URL: https://github.com/apache/dolphinscheduler/issues/17884

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   Describe the bug
   In a multi-master environment (DolphinScheduler 3.2.0), when a workflow 
containing a SubProcess task is executed, the Master logs report a 
NullPointerException in TaskStateEventHandler.
   
   The TaskStateEvent is broken because the taskCode is 0 and the status is 
null. This prevents the parent workflow from progressing, and the Master falls 
into an infinite retry loop for this event.
   
   Reproducibility
   
   100% Reproducible: This issue happens every time we run a workflow with a 
SubProcess in a multi-master setup.
   Single Master Test: When we scale down to a Single Master node, the issue 
disappears completely, and the same workflow finishes successfully. This 
confirms it is a synchronization or metadata visibility issue specific to the 
Multi-Master architecture.
   Environment:
   
   DolphinScheduler Version: 3.2.0
   OS: Linux
   Java Version: Java version "1.8.0_202"
   Database: MySQL
   Deployment Mode: Cluster (Multiple Master Servers)
   Log Snippet:
   
   text
   [INFO] 2026-01-15 17:11:38.720 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[292] 
- [WorkflowInstance-8249][TaskInstance-2445040] - Begin to handle state event, 
TaskStateEvent(processInstanceId=8249, taskInstanceId=2445040, taskCode=0, 
status=null, type=TASK_STATE_CHANGE, key=8250-0-8249-2445040, channel=null, 
context=null)
   [WARN] 2026-01-15 17:11:38.720 +0800 
org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler:[96] - 
[WorkflowInstance-8249][TaskInstance-2445040] - The task event is broken..., 
taskEvent: TaskStateEvent(processInstanceId=8249, taskInstanceId=2445040, 
taskCode=0, status=null, type=TASK_STATE_CHANGE, key=8250-0-8249-2445040, 
channel=null, context=null)
   [ERROR] 2026-01-15 17:11:38.720 +0800 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable:[317] 
- [WorkflowInstance-8249][TaskInstance-2445040] - State event handle error, get 
a unknown exception, will retry this event: 
TaskStateEvent(processInstanceId=8249, taskInstanceId=2445040, taskCode=0, 
status=null, type=TASK_STATE_CHANGE, key=8250-0-8249-2445040, channel=null, 
context=null)
   java.lang.NullPointerException: null
           at 
org.apache.dolphinscheduler.server.master.event.TaskStateEventHandler.handleStateEvent(TaskStateEventHandler.java:56)
           at 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable.handleEvents(WorkflowExecuteRunnable.java:293)
           ...
   Steps to Reproduce:
   
   Deploy DolphinScheduler 3.2.0 with 2 or more Master nodes.
   Create a SubProcess workflow.
   Create a Parent workflow containing a SubProcess node.
   Run the Parent workflow.
   Once the SubProcess completes, the NPE will be triggered in the Master 
managing the Parent workflow.
   Questions:
   
   Is this a known issue in the 3.2.0 release related to event distribution 
between Masters?
   If 3.2.0 is no longer the recommended stable version, could you please 
advise which version (e.g., 3.2.1, 3.2.2, or 3.3.x) contains the fix for this 
specific SubProcess callback issue?
   Expected Behavior:
   In a Multi-Master environment, the Master node should be able to correctly 
reconstruct the TaskStateEvent with the valid taskCode and status when a 
SubProcess completes.
   
   ### What you expected to happen
   
   1. Workaround: Since we are stuck on 3.2.0, is there any configuration 
change or workflow design adjustment (e.g., using different node types) that 
can bypass this NPE in a multi-master setup?
   2. Fixed Version: Which specific version (3.2.1, 3.2.2, or 3.3.x) officially 
fixes this taskCode=0 and status=null issue for SubProcess callbacks? I will 
use that version to verify the fix in our UAT environment.
   
   ### How to reproduce
   
   1. Deploy DolphinScheduler 3.2.0 with 3 Master nodes (HA).
   2. Use MySQL as the database (JDK 1.8.0_202).
   3. Create a workflow with a SubProcess node (pointing to a valid child 
workflow).
   4. Run the parent workflow.
   5. Observe Master logs: the issue happens every time in our setup.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to