Gallardot commented on PR #15270:
URL: 
https://github.com/apache/dolphinscheduler/pull/15270#issuecomment-1888523291

   ### This is an analysis of a bug related to the serial wait strategy, which 
causes the workflow instance to remain in a waiting state indefinitely.
   
   When a workflow's scheduled strategy is SERIAL_WAIT, if a workflow 
instance's status is WAITING, then this workflow instance will remain in a 
waiting state, even if the previous workflow instance has already completed 
execution.
   
   **There is a certain probability that this problem will occur.**
   
   The analysis of the cause is as follows: The `MasterSchedulerBootstrap` 
thread processes commands through the `handleCommand` method. Note that this 
`handleCommand` is within a transaction. In this transaction, the 
`saveSerialProcess` method is used to modify the status of the workflow 
instance. However, At the same time, in another thread pool of 
`WorkflowExecuteRunnable`, the `checkSerialProcess` method is used to check the 
status of the workflow instance in order to wake up the workflow instance in a 
waiting state.
   
   Everything seems fine. But there is a **specific situation**. That is, a 
workflow instance is about to complete, and a workflow instance is being 
created. Problems will arise at this time. Because of the isolation of 
transactions, the `saveSerialProcess` in the `handleCommand` method may have 
just been executed, but it has **not yet been committed**. At this time, the 
`checkSerialProcess` method will not be able to check that the status of this 
workflow instance is WAITING, causing this workflow instance to remain in a 
waiting state and cannot be awakened.
   
   My solution is to use a new transaction for updating the status of the 
workflow instance in the `handleCommand` transaction. This can avoid the above 
problem.  I have been running this in my environment for two months, and the 
problem has not reoccurred
   
   
https://github.com/apache/dolphinscheduler/blob/0f7081be10b657184d2eef316c8a2cafcf2ce343/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessServiceImpl.java#L291-L316
   
   
https://github.com/apache/dolphinscheduler/blob/bd48c991783b2e0ea0c602f6ef6c9a09c92e7b42/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessServiceImpl.java#L326-L342
   
   
https://github.com/apache/dolphinscheduler/blob/bd48c991783b2e0ea0c602f6ef6c9a09c92e7b42/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/WorkflowExecuteRunnable.java#L790-L832
   
   
   
   
   @ruanwenjun @Radeity @EricGao888 @SbloodyS @fuchanghai @qingwli @caishunfeng 
 PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to