This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/dolphinscheduler-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 6354dab Automated deployment: d423ef4bf7b9a7a6f48cf5bab5d6a100d42122a1
6354dab is described below
commit 6354dab8d4dbf5072c8f4998444ba15ca6eec3e1
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Mon Dec 27 04:05:58 2021 +0000
Automated deployment: d423ef4bf7b9a7a6f48cf5bab5d6a100d42122a1
---
en-us/docs/dev/user_doc/architecture/design.html | 21 ++++++++++++---------
en-us/docs/dev/user_doc/architecture/design.json | 2 +-
img/failover-master.jpg | Bin 0 -> 172842 bytes
img/failover-worker.jpg | Bin 0 -> 111717 bytes
zh-cn/docs/dev/user_doc/architecture/design.html | 20 ++++++++++++--------
zh-cn/docs/dev/user_doc/architecture/design.json | 2 +-
6 files changed, 26 insertions(+), 19 deletions(-)
diff --git a/en-us/docs/dev/user_doc/architecture/design.html
b/en-us/docs/dev/user_doc/architecture/design.html
index 9a7586f..c4a2419 100644
--- a/en-us/docs/dev/user_doc/architecture/design.html
+++ b/en-us/docs/dev/user_doc/architecture/design.html
@@ -180,19 +180,23 @@ In the above figure, MainFlowThread waits for the end of
SubFlowThread1, SubFlow
</p>
Among them, the Master monitors the directories of other Masters and Workers.
If the remove event is heard, fault tolerance of the process instance or task
instance will be performed according to the specific business logic.
<ul>
-<li>Master fault tolerance flowchart:</li>
+<li>Master fault tolerance:</li>
</ul>
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_master.png"
alt="Master fault tolerance flowchart" width="40%" />
+<p align="center">
+ <img src="/img/failover-master.jpg" alt="failover-master" width="50%" />
</p>
-After the fault tolerance of ZooKeeper Master is completed, it is re-scheduled
by the Scheduler thread in DolphinScheduler, traverses the DAG to find the
"running" and "submit successful" tasks, monitors the status of its task
instances for the "running" tasks, and "commits successful" tasks It is
necessary to determine whether the task queue already exists. If it exists, the
status of the task instance is also monitored. If it does not exist, resubmit
the task instance.
+<p>Fault tolerance range: From the perspective of host, the fault tolerance
range of Master includes: own host + node host that does not exist in the
registry, and the entire process of fault tolerance will be locked;</p>
+<p>Fault-tolerant content: Master's fault-tolerant content includes:
fault-tolerant process instances and task instances. Before fault-tolerant, it
compares the start time of the instance with the server start-up time, and
skips fault-tolerance if after the server start time;</p>
+<p>Fault-tolerant post-processing: After the fault tolerance of ZooKeeper
Master is completed, it is re-scheduled by the Scheduler thread in
DolphinScheduler, traverses the DAG to find the "running" and
"submit successful" tasks, monitors the status of its task instances
for the "running" tasks, and "commits successful" tasks It
is necessary to determine whether the task queue already exists. If it exists,
the status of the task instance is also mo [...]
<ul>
-<li>Worker fault tolerance flowchart:</li>
+<li>Worker fault tolerance:</li>
</ul>
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_worker.png"
alt="Worker fault tolerance flow chart" width="40%" />
+<p align="center">
+ <img src="/img/failover-worker.jpg" alt="failover-worker" width="50%" />
</p>
-<p>Once the Master Scheduler thread finds that the task instance is in the
"fault-tolerant" state, it takes over the task and resubmits it.</p>
+<p>Fault tolerance range: From the perspective of process instance, each
Master is only responsible for fault tolerance of its own process instance; it
will lock only when <code>handleDeadServer</code>;</p>
+<p>Fault-tolerant content: When sending the remove event of the Worker node,
the Master only fault-tolerant task instances. Before fault-tolerant, it
compares the start time of the instance with the server start-up time, and
skips fault-tolerance if after the server start time;</p>
+<p>Fault-tolerant post-processing: Once the Master Scheduler thread finds that
the task instance is in the "fault-tolerant" state, it takes over the
task and resubmits it.</p>
<p>Note: Due to "network jitter", the node may lose its heartbeat
with ZooKeeper in a short period of time, and the node's remove event may
occur. For this situation, we use the simplest way, that is, once the node and
ZooKeeper timeout connection occurs, then directly stop the Master or Worker
service.</p>
<h6>2.Task failed and try again</h6>
<p>Here we must first distinguish the concepts of task failure retry, process
failure recovery, and process failure rerun:</p>
@@ -325,7 +329,6 @@ public class TaskLogFilter extends
Filter<ILoggingEvent> {
### Sum up
From the perspective of scheduling, this article preliminarily introduces the
architecture principles and implementation ideas of the big data distributed
workflow scheduling system-DolphinScheduler. To be continued
-
</code></pre>
</div></section><footer class="footer-container"><div
class="footer-body"><div><h3>About us</h3><h4>Do you need feedback? Please
contact us through the following ways.</h4></div><div
class="contact-container"><ul><li><a
href="/en-us/community/development/subscribe.html"><img class="img-base"
src="/img/emailgray.png"/><img class="img-change"
src="/img/emailblue.png"/><p>Email List</p></a></li><li><a
href="https://twitter.com/dolphinschedule"><img class="img-base"
src="/img/twittergray.png [...]
<script
src="//cdn.jsdelivr.net/npm/[email protected]/dist/react-with-addons.min.js"></script>
diff --git a/en-us/docs/dev/user_doc/architecture/design.json
b/en-us/docs/dev/user_doc/architecture/design.json
index 5e7b231..2d53d95 100644
--- a/en-us/docs/dev/user_doc/architecture/design.json
+++ b/en-us/docs/dev/user_doc/architecture/design.json
@@ -1,6 +1,6 @@
{
"filename": "design.md",
- "__html": "<h2>System Architecture Design</h2>\n<p>Before explaining the
architecture of the scheduling system, let's first understand the commonly used
terms of the scheduling
system</p>\n<h3>1.Glossary</h3>\n<p><strong>DAG:</strong> The full name is
Directed Acyclic Graph, referred to as DAG. Task tasks in the workflow are
assembled in the form of a directed acyclic graph, and topological traversal is
performed from nodes with zero degrees of entry until there are no subsequent
nodes [...]
+ "__html": "<h2>System Architecture Design</h2>\n<p>Before explaining the
architecture of the scheduling system, let's first understand the commonly used
terms of the scheduling
system</p>\n<h3>1.Glossary</h3>\n<p><strong>DAG:</strong> The full name is
Directed Acyclic Graph, referred to as DAG. Task tasks in the workflow are
assembled in the form of a directed acyclic graph, and topological traversal is
performed from nodes with zero degrees of entry until there are no subsequent
nodes [...]
"link": "/dist/en-us/docs/dev/user_doc/architecture/design.html",
"meta": {}
}
\ No newline at end of file
diff --git a/img/failover-master.jpg b/img/failover-master.jpg
new file mode 100644
index 0000000..5776781
Binary files /dev/null and b/img/failover-master.jpg differ
diff --git a/img/failover-worker.jpg b/img/failover-worker.jpg
new file mode 100644
index 0000000..71f5936
Binary files /dev/null and b/img/failover-worker.jpg differ
diff --git a/zh-cn/docs/dev/user_doc/architecture/design.html
b/zh-cn/docs/dev/user_doc/architecture/design.html
index 3dba759..50bf6f5 100644
--- a/zh-cn/docs/dev/user_doc/architecture/design.html
+++ b/zh-cn/docs/dev/user_doc/architecture/design.html
@@ -181,19 +181,23 @@ Server基于netty提供监听服务。Worker</p>
</p>
其中Master监控其他Master和Worker的目录,如果监听到remove事件,则会根据具体的业务逻辑进行流程实例容错或者任务实例容错。
<ul>
-<li>Master容错流程图:</li>
+<li>Master容错流程:</li>
</ul>
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_master.png"
alt="Master容错流程图" width="40%" />
+<p align="center">
+ <img src="/img/failover-master.jpg" alt="容错流程" width="50%" />
</p>
-ZooKeeper Master容错完成之后则重新由DolphinScheduler中Scheduler线程调度,遍历 DAG
找到”正在运行”和“提交成功”的任务,对”正在运行”的任务监控其任务实例的状态,对”提交成功”的任务需要判断Task
Queue中是否已经存在,如果存在则同样监控任务实例的状态,如果不存在则重新提交任务实例。
+<p>容错范围:从host的维度来看,Master的容错范围包括:自身host+注册中心上不存在的节点host,容错的整个过程会加锁;</p>
+<p>容错内容:Master的容错内容包括:容错工作流实例和任务实例,在容错前会比较实例的开始时间和服务节点的启动时间,在服务启动时间之后的则跳过容错;</p>
+<p>容错后处理:ZooKeeper Master容错完成之后则重新由DolphinScheduler中Scheduler线程调度,遍历 DAG
找到”正在运行”和“提交成功”的任务,对”正在运行”的任务监控其任务实例的状态,对”提交成功”的任务需要判断Task
Queue中是否已经存在,如果存在则同样监控任务实例的状态,如果不存在则重新提交任务实例。</p>
<ul>
-<li>Worker容错流程图:</li>
+<li>Worker容错流程:</li>
</ul>
- <p align="center">
- <img
src="https://analysys.github.io/easyscheduler_docs_cn/images/fault-tolerant_worker.png"
alt="Worker容错流程图" width="40%" />
+<p align="center">
+ <img src="/img/failover-worker.jpg" alt="容错流程" width="50%" />
</p>
-<p>Master Scheduler线程一旦发现任务实例为” 需要容错”状态,则接管任务并进行重新提交。</p>
+<p>容错范围:从工作流实例的维度看,每个Master只负责容错自己的工作流实例;只有在<code>handleDeadServer</code>时会加锁;</p>
+<p>容错内容:当发送Worker节点的remove事件时,Master只容错任务实例,在容错前会比较实例的开始时间和服务节点的启动时间,在服务启动时间之后的则跳过容错;</p>
+<p>容错后处理:Master Scheduler线程一旦发现任务实例为” 需要容错”状态,则接管任务并进行重新提交。</p>
<p>注意:由于”
网络抖动”可能会使得节点短时间内失去和ZooKeeper的心跳,从而发生节点的remove事件。对于这种情况,我们使用最简单的方式,那就是节点一旦和ZooKeeper发生超时连接,则直接将Master或Worker服务停掉。</p>
<h6>2.任务失败重试</h6>
<p>这里首先要区分任务失败重试、流程失败恢复、流程失败重跑的概念:</p>
diff --git a/zh-cn/docs/dev/user_doc/architecture/design.json
b/zh-cn/docs/dev/user_doc/architecture/design.json
index 36f2936..3a3984f 100644
--- a/zh-cn/docs/dev/user_doc/architecture/design.json
+++ b/zh-cn/docs/dev/user_doc/architecture/design.json
@@ -1,6 +1,6 @@
{
"filename": "design.md",
- "__html":
"<h2>系统架构设计</h2>\n<p>在对调度系统架构说明之前,我们先来认识一下调度系统常用的名词</p>\n<h3>1.名词解释</h3>\n<p><strong>DAG:</strong>
全称Directed Acyclic
Graph,简称DAG。工作流中的Task任务以有向无环图的形式组装起来,从入度为零的节点进行拓扑遍历,直到无后继节点为止。举例如下图:</p>\n<p
align=\"center\">\n <img src=\"/img/dag_examples_cn.jpg\" alt=\"dag示例\"
width=\"60%\" />\n <p align=\"center\">\n <em>dag示例</em>\n
</p>\n</p>\n<p><strong>流程定义</strong>:通过拖拽任务节点并建立任务节点的关联所形成的可视化<strong>DAG</strong></p>\n<p><strong>流程实例</strong>:流程实例是流程定义的实例化,可以通过手动启动或定时调度生成,
[...]
+ "__html":
"<h2>系统架构设计</h2>\n<p>在对调度系统架构说明之前,我们先来认识一下调度系统常用的名词</p>\n<h3>1.名词解释</h3>\n<p><strong>DAG:</strong>
全称Directed Acyclic
Graph,简称DAG。工作流中的Task任务以有向无环图的形式组装起来,从入度为零的节点进行拓扑遍历,直到无后继节点为止。举例如下图:</p>\n<p
align=\"center\">\n <img src=\"/img/dag_examples_cn.jpg\" alt=\"dag示例\"
width=\"60%\" />\n <p align=\"center\">\n <em>dag示例</em>\n
</p>\n</p>\n<p><strong>流程定义</strong>:通过拖拽任务节点并建立任务节点的关联所形成的可视化<strong>DAG</strong></p>\n<p><strong>流程实例</strong>:流程实例是流程定义的实例化,可以通过手动启动或定时调度生成,
[...]
"link": "/dist/zh-cn/docs/dev/user_doc/architecture/design.html",
"meta": {}
}
\ No newline at end of file