[I] [Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node [dolphinscheduler]

via GitHub Thu, 31 Oct 2024 23:46:44 -0700


1105560808 opened a new issue, #16759:
URL: https://github.com/apache/dolphinscheduler/issues/16759


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   "Due to network issues, Master lost connection with ZooKeeper, triggering 
the failover mechanism. However, the original Master was still running with 
tasks in execution and next nodes waiting in memory. Meanwhile, other Master 
nodes detected the issue and regenerated the task DAG. When the previous node 
completed, both Masters simultaneously executed the next node, causing multiple 
Worker nodes to process the same task. This may lead to subsequent task state 
inconsistency issues."
   
   ### What you expected to happen
   
   After Master loses connection with ZooKeeper due to network issues, 
concurrent execution of the same task should not occur
   
   ### How to reproduce
   
   Steps:
   1. Identify a workflow with long-running node
   2. During node execution:
      - Disconnect Master from ZooKeeper
      - Use pause strategy (not stop)
      - Trigger Master failover
   3. Wait for current node completion
   4. Verify:
      - Check for duplicate execution of subsequent nodes
      - Monitor task state consistency
   
   ### Anything else
   
   Proposed Solution:
   Before submitting next node task, Master should:
   1. Verify host in processInstance
   2. Compare with current Master's host
   3. Exit if mismatch detected
   
   ### Version
   
   3.2.x
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] [Master] Network exception occurred between Master and ZooKeeper, triggering failover mechanism, which caused duplicate task execution on the next node [dolphinscheduler]

Reply via email to