1105560808 opened a new issue, #16759: URL: https://github.com/apache/dolphinscheduler/issues/16759
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened "Due to network issues, Master lost connection with ZooKeeper, triggering the failover mechanism. However, the original Master was still running with tasks in execution and next nodes waiting in memory. Meanwhile, other Master nodes detected the issue and regenerated the task DAG. When the previous node completed, both Masters simultaneously executed the next node, causing multiple Worker nodes to process the same task. This may lead to subsequent task state inconsistency issues." ### What you expected to happen After Master loses connection with ZooKeeper due to network issues, concurrent execution of the same task should not occur ### How to reproduce Steps: 1. Identify a workflow with long-running node 2. During node execution: - Disconnect Master from ZooKeeper - Use pause strategy (not stop) - Trigger Master failover 3. Wait for current node completion 4. Verify: - Check for duplicate execution of subsequent nodes - Monitor task state consistency ### Anything else Proposed Solution: Before submitting next node task, Master should: 1. Verify host in processInstance 2. Compare with current Master's host 3. Exit if mismatch detected ### Version 3.2.x ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
