[I] [Bug] [FailoverCoordinator] Duplicated Fail-overed workflow handling [dolphinscheduler]

via GitHub Wed, 16 Jul 2025 20:03:27 -0700


reele opened a new issue, #17342:
URL: https://github.com/apache/dolphinscheduler/issues/17342


   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   when restart master-cluster, there alive master-server's workflow instance 
may take-overed by new master instance.
   
   there are some log segments, when  `master-server 172.2.1.20` started, it 
first take over  `master-server 172.2.1.21`'s workflows(WORKFLOW-A,B,C), then 
`master-server 172.2.1.21` started, as the logic of 
`getFailoverWorkflowsForMaster`, there use workflows' start_time to compare 
with  `master-server 172.2.1.20`'s startup time, which were take-overed by  
`master-server 172.2.1.20` just now, and create command to do failover again.
   
   ```
   ====== master-server 172.2.1.20 ======
   
   [WI-0][TI-0] - 2025-07-10 20:30:49.341 INFO  [Master-Server] 
o.a.d.s.m.e.s.SystemEventBus:[40] - Published SystemEvent: 
GlobalMasterFailoverEvent{eventTime=Thu Jul 10 20:30:48 GMT+08:00 2025}
   [WI-0][TI-0] - 2025-07-10 20:30:49.341 INFO  [Master-Server] 
o.a.d.s.m.e.s.SystemEventBusFireWorker:[62] - SystemEventBusFireWorker started
   [WI-0][TI-0] - 2025-07-10 20:30:49.346 INFO  [Master-Server] 
o.a.d.s.m.MasterServer:[164] - MasterServer initialized successfully in 1343 ms
   [WI-0][TI-0] - 2025-07-10 20:30:49.346 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[73] - Global master failover starting
   [WI-0][TI-0] - 2025-07-10 20:30:49.490 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[90] - The master[172.2.1.21:5678] is not 
alive, do global master failover on it
   [WI-0][TI-0] - 2025-07-10 20:30:49.818 INFO  [Curator-TreeCache-0] 
o.a.d.s.m.c.AbstractClusterSubscribeListener:[41] - Server 
WorkerServerMetadata(workerGroup=default, workerWeight=100.0, 
taskThreadPoolUsage=0.0) added
   [WI-0][TI-0] - 2025-07-10 20:30:50.117 INFO  [Curator-TreeCache-0] 
o.a.d.s.m.c.AbstractClusterSubscribeListener:[41] - Server 
WorkerServerMetadata(workerGroup=default, workerWeight=100.0, 
taskThreadPoolUsage=0.0) added
   [WI-0][TI-0] - 2025-07-10 20:30:50.779 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.WorkflowFailover:[64] - Success failover workflowInstance: 
[id=4826511, name=WORKFLOW-C-3-20250710050000600, state=RUNNING_EXECUTION]
   [WI-0][TI-0] - 2025-07-10 20:30:50.787 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.WorkflowFailover:[64] - Success failover workflowInstance: 
[id=4826537, name=WORKFLOW-B-5-20250710050001109, state=RUNNING_EXECUTION]
   [WI-0][TI-0] - 2025-07-10 20:30:50.793 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.WorkflowFailover:[64] - Success failover workflowInstance: 
[id=4828292, name=WORKFLOW-A-20250710194500099, state=RUNNING_EXECUTION]
   [WI-0][TI-0] - 2025-07-10 20:30:50.795 INFO  [SystemEventBusFireWorker] 
o.a.d.r.a.RegistryClient:[177] - persist key: 
/nodes/failover-finish-nodes/172.2.1.21:5678-unknown-unknown, value: 
1752150648003
   [WI-0][TI-0] - 2025-07-10 20:30:50.803 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[155] - Master[172.2.1.21:5678] failover 3 
workflows finished, cost: 1312/ms
   [WI-0][TI-0] - 2025-07-10 20:30:50.803 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[99] - Global master failover finished, cost: 
1457/ms
   [WI-0][TI-0] - 2025-07-10 20:30:50.803 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.e.s.SystemEventBusFireWorker:[103] - Fire SystemEvent: 
GlobalMasterFailoverEvent{eventTime=Thu Jul 10 20:30:48 GMT+08:00 2025} cost: 
1459 ms
   [WI-0][TI-0] - 2025-07-10 20:30:51.011 INFO  [MasterCommandHandleThreadPool] 
o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: 
WorkflowStartLifecycleEvent{workflow=WORKFLOW-A-20250710194500099}
   [WI-0][TI-0] - 2025-07-10 20:30:51.012 INFO  [MasterCommandHandleThreadPool] 
o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: 
WorkflowStartLifecycleEvent{workflow=WORKFLOW-B-5-20250710050001109}
   [WI-0][TI-0] - 2025-07-10 20:30:51.012 INFO  [MasterCommandHandleThreadPool] 
o.a.d.s.m.e.WorkflowEventBus:[41] - Publish event: 
WorkflowStartLifecycleEvent{workflow=WORKFLOW-C-3-20250710050000600}
   [WI-0][TI-0] - 2025-07-10 20:30:51.018 INFO  [MasterCommandHandleThreadPool] 
o.a.d.s.m.e.c.CommandEngine:[174] - Success bootstrap command {
   ...
   
   
   ====== master-server 172.2.1.21 ======
   
   [WI-0][TI-0] - 2025-07-10 20:30:53.603 INFO  [Master-Server] 
o.a.d.s.m.e.s.SystemEventBus:[40] - Published SystemEvent: 
GlobalMasterFailoverEvent{eventTime=Thu Jul 10 20:30:51 GMT+08:00 2025}
   [WI-0][TI-0] - 2025-07-10 20:30:53.603 INFO  [Master-Server] 
o.a.d.s.m.e.s.SystemEventBusFireWorker:[62] - SystemEventBusFireWorker started
   [WI-0][TI-0] - 2025-07-10 20:30:53.609 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[73] - Global master failover starting
   [WI-0][TI-0] - 2025-07-10 20:30:53.612 INFO  [Master-Server] 
o.a.d.s.m.MasterServer:[164] - MasterServer initialized successfully in 1670 ms
   [WI-0][TI-0] - 2025-07-10 20:30:53.720 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[82] - The 
master[MasterServerMetadata(super=BaseServerMetadata(processId=9652, 
serverStartupTime=1752150648003, address=172.2.1.20:5678, 
cpuUsage=0.02604212364461751, memoryUsage=0.33144767254423296, 
serverStatus=NORMAL))] is alive, do global master failover on it
   [WI-0][TI-0] - 2025-07-10 20:30:53.927 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.WorkflowFailover:[64] - Success failover workflowInstance: 
[id=4826511, name=WORKFLOW-C-3-20250710050000600, state=RUNNING_EXECUTION]
   [WI-0][TI-0] - 2025-07-10 20:30:53.933 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.WorkflowFailover:[64] - Success failover workflowInstance: 
[id=4826537, name=WORKFLOW-B-5-20250710050001109, state=RUNNING_EXECUTION]
   [WI-0][TI-0] - 2025-07-10 20:30:53.940 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.WorkflowFailover:[64] - Success failover workflowInstance: 
[id=4828292, name=WORKFLOW-A-20250710194500099, state=RUNNING_EXECUTION]
   [WI-0][TI-0] - 2025-07-10 20:30:53.942 INFO  [SystemEventBusFireWorker] 
o.a.d.r.a.RegistryClient:[177] - persist key: 
/nodes/failover-finish-nodes/172.2.1.20:5678-unknown-unknown, value: 
1752150648003
   [WI-0][TI-0] - 2025-07-10 20:30:53.953 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[155] - Master[172.2.1.20:5678] failover 3 
workflows finished, cost: 231/ms
   [WI-0][TI-0] - 2025-07-10 20:30:53.953 INFO  [SystemEventBusFireWorker] 
o.a.d.s.m.f.FailoverCoordinator:[82] - The 
master[MasterServerMetadata(super=BaseServerMetadata(processId=28662, 
serverStartupTime=1752150651942, address=172.2.1.21:5678, 
cpuUsage=0.025943546541755118, memoryUsage=0.3181573051280973, 
serverStatus=NORMAL))] is alive, do global master failover on it
   ```
   
   so i think there need a host-timestamp (host address + host startup time) to 
mark the workflow-instance(e.g. set to `workflowinstance.host`), to decide 
whether the workflow should be takeover, This approach may be more stable and 
simple.
   
   ### What you expected to happen
   
   -
   
   ### How to reproduce
   
   -
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   dev
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] [FailoverCoordinator] Duplicated Fail-overed workflow handling [dolphinscheduler]

Reply via email to