reele commented on issue #17342:
URL: 
https://github.com/apache/dolphinscheduler/issues/17342#issuecomment-3086365872

   
   > We do update the workflow's host, there might exist concurrent issue, it's 
better to get the workflow's host from executor, once the event ready to 
report, then get the host from executor.
   > 
   > ```
   > public boolean reassignWorkflowInstanceHost(final 
TaskExecutorReassignMasterRequest taskExecutorReassignMasterRequest) {
   >         final int taskInstanceId = 
taskExecutorReassignMasterRequest.getTaskInstanceId();
   >         final String workflowHost = 
taskExecutorReassignMasterRequest.getWorkflowHost();
   >         // todo: Is this reassign can make sure there is no concurrent 
problem?
   >         physicalTaskExecutorRepository.get(taskInstanceId).ifPresent(
   >                 taskExecutor -> 
taskExecutor.getTaskExecutionContext().setWorkflowInstanceHost(workflowHost));
   >         return 
physicalTaskExecutorEventReporter.reassignWorkflowInstanceHost(taskInstanceId, 
workflowHost);
   >     }
   > ```
   
   Oh i found why! it's caused by this issue, there are the other logs:
   ```
   [WI-0][TI-0] - 2025-07-10 20:30:54.101 ERROR [MasterCommandHandleThreadPool] 
o.a.d.s.m.e.c.CommandEngine:[186] - Failed bootstrap command {
     "id" : 4889016,
     "commandType" : "RECOVER_TOLERANCE_FAULT_PROCESS",
     "workflowDefinitionCode" : 15081302155680,
     "workflowDefinitionVersion" : 20,
     "workflowInstanceId" : 4828292,
     "commandParam" : 
"{\"commandType\":\"RECOVER_TOLERANCE_FAULT\",\"subWorkflowInstance\":false,\"startNodes\":null,\"commandParams\":null,\"timeZone\":null,\"workflowExecutionStatus\":\"RUNNING_EXECUTION\"}",
     "workflowInstancePriority" : "MEDIUM",
     "executorId" : 0,
     "taskDependType" : "TASK_POST",
     "failureStrategy" : "CONTINUE",
     "warningType" : "NONE",
     "warningGroupId" : null,
     "scheduleTime" : null,
     "startTime" : null,
     "updateTime" : "2025-07-10 20:30:53",
     "workerGroup" : null,
     "tenantCode" : "default",
     "environmentCode" : -1,
     "dryRun" : 0
   } 
   java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
WorkflowExecuteRunnable(4828292/WORKFLOW-A-20250710194500099 already registered 
at WorkflowEventBusFireWorker
           at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
           at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
           at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:673)
           at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
           at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
           at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalStateException: 
WorkflowExecuteRunnable(4828292/WORKFLOW-A-20250710194500099 already registered 
at WorkflowEventBusFireWorker
           at 
com.google.common.base.Preconditions.checkState(Preconditions.java:821)
           at 
org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusFireWorker.registerWorkflowEventBus(WorkflowEventBusFireWorker.java:63)
           at 
org.apache.dolphinscheduler.server.master.engine.WorkflowEventBusCoordinator.registerWorkflowEventBus(WorkflowEventBusCoordinator.java:50)
           at 
org.apache.dolphinscheduler.server.master.engine.command.CommandEngine.bootstrapWorkflowExecutionRunnable(CommandEngine.java:167)
           at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
           ... 6 common frames omitted
   ```
   
   after the master-2.1.21 started, it published the failover command too, 
coincidentally, master 2.1.20 captured this command, after called 
`bootstrapCommand` in `CommandEngine`, it failed on 
`bootstrapWorkflowExecutionRunnable`, so the task executor is already 
reassigned to master 2.1.20 again, and the new `workflowExecutionRunnable` is 
already put into `workflowRepository`, but failed in 
`workflowEventBusCoordinator.registerWorkflowEventBus`, so there is no thread 
to handle the new `workflowExecutionRunnable`'s event bus, so all the events 
got stuck in the executor's channel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to