dailidong commented on issue #4754:
URL: 
https://github.com/apache/incubator-dolphinscheduler/issues/4754#issuecomment-778105303


   > When the Master hangs up, other online Masters will receive the ZK Master 
node remove event to perform fault tolerance. First, they will query all the 
process instances that need fault tolerance on the dead Master, and then 
generate fault tolerance Commands and write them into the command table of the 
database.
   > Then after the normal running Master obtains the lock, it gets the 
fault-tolerant Command in the database, starts execution, initialization, DAG 
construction, etc., and then finds all the head nodes. If the task status has 
been completed, continue to look down until the process instance is found Task 
node running
   > As a result, the new Master takes over the process instance and continues 
to monitor the task status. After the current task node is executed, it 
continues to dispatch tasks
   > 
   > 当Master 挂掉以后,其他在线的 Master 会收到 ZK Master节点 remove 事件,进行容错,首先会查询 死掉的Master 
上面所有需要容错的 流程实例,然后生成容错Command,写入数据库的command表中
   > 然后正常运行的Master 
获取锁以后,到数据库中拿到容错Command,开始执行,初始化,DAG构建等,然后找出所有头结点,如果任务状态已经完成,继续向下查找,直到找到流程实例正在运行的任务节点
   > 由此新的Master 就接管了这个流程实例,继续监控任务状态,当前任务节点执行完毕以后,继续派发任务
   
   good question and answer


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to