This is an automated email from the ASF dual-hosted git repository.
zhongjiajie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/dolphinscheduler-website.git
The following commit(s) were added to refs/heads/master by this push:
new d423ef4 update failover logic (#610)
d423ef4 is described below
commit d423ef4bf7b9a7a6f48cf5bab5d6a100d42122a1
Author: wind <[email protected]>
AuthorDate: Mon Dec 27 12:01:46 2021 +0800
update failover logic (#610)
* failover logic
* fix style
* Update docs/en-us/dev/user_doc/architecture/design.md
Co-authored-by: caishunfeng <[email protected]>
Co-authored-by: Jiajie Zhong <[email protected]>
---
docs/en-us/dev/user_doc/architecture/design.md | 24 +++++++++++++++---------
docs/zh-cn/dev/user_doc/architecture/design.md | 25 ++++++++++++++++---------
img/failover-master.jpg | Bin 0 -> 172842 bytes
img/failover-worker.jpg | Bin 0 -> 111717 bytes
4 files changed, 31 insertions(+), 18 deletions(-)
diff --git a/docs/en-us/dev/user_doc/architecture/design.md
b/docs/en-us/dev/user_doc/architecture/design.md
index c3db4c9..ad1485b 100644
--- a/docs/en-us/dev/user_doc/architecture/design.md
+++ b/docs/en-us/dev/user_doc/architecture/design.md
@@ -187,22 +187,29 @@ Among them, the Master monitors the directories of other
Masters and Workers. If
-- Master fault tolerance flowchart:
+- Master fault tolerance:
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_master.png"
alt="Master fault tolerance flowchart" width="40%" />
+<p align="center">
+ <img src="/img/failover-master.jpg" alt="failover-master" width="50%" />
</p>
-After the fault tolerance of ZooKeeper Master is completed, it is re-scheduled
by the Scheduler thread in DolphinScheduler, traverses the DAG to find the
"running" and "submit successful" tasks, monitors the status of its task
instances for the "running" tasks, and "commits successful" tasks It is
necessary to determine whether the task queue already exists. If it exists, the
status of the task instance is also monitored. If it does not exist, resubmit
the task instance.
+Fault tolerance range: From the perspective of host, the fault tolerance range
of Master includes: own host + node host that does not exist in the registry,
and the entire process of fault tolerance will be locked;
+Fault-tolerant content: Master's fault-tolerant content includes:
fault-tolerant process instances and task instances. Before fault-tolerant, it
compares the start time of the instance with the server start-up time, and
skips fault-tolerance if after the server start time;
-- Worker fault tolerance flowchart:
+Fault-tolerant post-processing: After the fault tolerance of ZooKeeper Master
is completed, it is re-scheduled by the Scheduler thread in DolphinScheduler,
traverses the DAG to find the "running" and "submit successful" tasks, monitors
the status of its task instances for the "running" tasks, and "commits
successful" tasks It is necessary to determine whether the task queue already
exists. If it exists, the status of the task instance is also monitored. If it
does not exist, resubmit the [...]
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_worker.png"
alt="Worker fault tolerance flow chart" width="40%" />
+- Worker fault tolerance:
+
+<p align="center">
+ <img src="/img/failover-worker.jpg" alt="failover-worker" width="50%" />
</p>
-Once the Master Scheduler thread finds that the task instance is in the
"fault-tolerant" state, it takes over the task and resubmits it.
+Fault tolerance range: From the perspective of process instance, each Master
is only responsible for fault tolerance of its own process instance; it will
lock only when `handleDeadServer`;
+
+Fault-tolerant content: When sending the remove event of the Worker node, the
Master only fault-tolerant task instances. Before fault-tolerant, it compares
the start time of the instance with the server start-up time, and skips
fault-tolerance if after the server start time;
+
+Fault-tolerant post-processing: Once the Master Scheduler thread finds that
the task instance is in the "fault-tolerant" state, it takes over the task and
resubmits it.
Note: Due to "network jitter", the node may lose its heartbeat with ZooKeeper
in a short period of time, and the node's remove event may occur. For this
situation, we use the simplest way, that is, once the node and ZooKeeper
timeout connection occurs, then directly stop the Master or Worker service.
@@ -328,4 +335,3 @@ public class TaskLogFilter extends Filter<ILoggingEvent> {
### Sum up
From the perspective of scheduling, this article preliminarily introduces the
architecture principles and implementation ideas of the big data distributed
workflow scheduling system-DolphinScheduler. To be continued
-
diff --git a/docs/zh-cn/dev/user_doc/architecture/design.md
b/docs/zh-cn/dev/user_doc/architecture/design.md
index 7345ea8..f5030d3 100644
--- a/docs/zh-cn/dev/user_doc/architecture/design.md
+++ b/docs/zh-cn/dev/user_doc/architecture/design.md
@@ -184,24 +184,31 @@ DolphinScheduler使用ZooKeeper分布式锁来实现同一时刻只有一台Mast
</p>
其中Master监控其他Master和Worker的目录,如果监听到remove事件,则会根据具体的业务逻辑进行流程实例容错或者任务实例容错。
+- Master容错流程:
+<p align="center">
+ <img src="/img/failover-master.jpg" alt="容错流程" width="50%" />
+ </p>
-- Master容错流程图:
+容错范围:从host的维度来看,Master的容错范围包括:自身host+注册中心上不存在的节点host,容错的整个过程会加锁;
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_master.png"
alt="Master容错流程图" width="40%" />
- </p>
-ZooKeeper Master容错完成之后则重新由DolphinScheduler中Scheduler线程调度,遍历 DAG
找到”正在运行”和“提交成功”的任务,对”正在运行”的任务监控其任务实例的状态,对”提交成功”的任务需要判断Task
Queue中是否已经存在,如果存在则同样监控任务实例的状态,如果不存在则重新提交任务实例。
+容错内容:Master的容错内容包括:容错工作流实例和任务实例,在容错前会比较实例的开始时间和服务节点的启动时间,在服务启动时间之后的则跳过容错;
+容错后处理:ZooKeeper Master容错完成之后则重新由DolphinScheduler中Scheduler线程调度,遍历 DAG
找到”正在运行”和“提交成功”的任务,对”正在运行”的任务监控其任务实例的状态,对”提交成功”的任务需要判断Task
Queue中是否已经存在,如果存在则同样监控任务实例的状态,如果不存在则重新提交任务实例。
-- Worker容错流程图:
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_worker.png"
alt="Worker容错流程图" width="40%" />
+- Worker容错流程:
+
+<p align="center">
+ <img src="/img/failover-worker.jpg" alt="容错流程" width="50%" />
</p>
-Master Scheduler线程一旦发现任务实例为” 需要容错”状态,则接管任务并进行重新提交。
+容错范围:从工作流实例的维度看,每个Master只负责容错自己的工作流实例;只有在`handleDeadServer`时会加锁;
+
+容错内容:当发送Worker节点的remove事件时,Master只容错任务实例,在容错前会比较实例的开始时间和服务节点的启动时间,在服务启动时间之后的则跳过容错;
+
+容错后处理:Master Scheduler线程一旦发现任务实例为” 需要容错”状态,则接管任务并进行重新提交。
注意:由于”
网络抖动”可能会使得节点短时间内失去和ZooKeeper的心跳,从而发生节点的remove事件。对于这种情况,我们使用最简单的方式,那就是节点一旦和ZooKeeper发生超时连接,则直接将Master或Worker服务停掉。
diff --git a/img/failover-master.jpg b/img/failover-master.jpg
new file mode 100644
index 0000000..5776781
Binary files /dev/null and b/img/failover-master.jpg differ
diff --git a/img/failover-worker.jpg b/img/failover-worker.jpg
new file mode 100644
index 0000000..71f5936
Binary files /dev/null and b/img/failover-worker.jpg differ