Re: [DISCUSS] Table lineage design

boyi Wed, 28 Oct 2020 19:27:57 -0700

hi:


Is there a specific plan for splitting JSON data??


If there is no plan in the next two or three months, I think it can be done


You can create a task type, such as SQL, to see the effect, and then consider 
other task types


By the way, do you support blood relationships at the table level or at the 
field level
————————————————————
拆分json数据有没有具体的计划??
如果未来两道三个月没有计划的话,我认为可以做.
可以先做一个任务类型,比如[SQL],看看效果,然后再考虑其他的任务类型.


顺便问一下, 是支持到表级别的血缘关系还是字段级别的血缘关系
--------------------------------------
BoYi ZhangE-mail : [email protected]
On 10/29/2020 10:18，Hemin Wen<[email protected]> wrote：
This function does not conflict with split workflow json, A single table
maintains dependencies.

This function is for sql related nodes, E.g. SQL node, ETL node, Sqoop node.
The design of the t_ds_node_lineage table is for expansion

https://github.com/apache/incubator-dolphinscheduler/issues/249,
This issue reflects that the demand is real and many people need it.
For the maintenance of batch nodes, it is currently a bottleneck for DS.

——————————————————————————————————————————————
我认为，这个功能和拆分工作流json并不冲突，因为依赖是单独维护在一张表中的。

这个功能面向sql相关的节点，例如：sql节点、etl节点、sqoop节点。
血缘关系表的设计也是面向扩展的，并不是只针对于sql设计，sql只是依赖的一种来源

https://github.com/apache/incubator-dolphinscheduler/issues/249，
可以看下这个issue，真实反映了需求是实际存在的，而且很多人需要这个功能。
针对于批量节点关系的维护，当前是DS的一个痛点，其中sql相关的依赖相对更多。

--------------------
DolphinScheduler(Incubator) Commtter
Hemin Wen  温合民
[email protected]
--------------------


wu shaoj <[email protected]> 于2020年10月29日周四 上午9:14写道：

Not a good idea.
感觉太复杂了，还是等大json拆分之后再说吧。而且现在只有SQL节点，并不适合用来解析依赖关系，SQLScript才适合！
依赖关系的配置是一个比较麻烦的事情，自动创建在现阶段的优先级并不太高。

From: Hemin Wen <[email protected]>
Date: Wednesday, October 28, 2020 at 11:49
To: dev <[email protected]>
Subject: [DISCUSS] Table lineage design
Hi!

The function of table lineage automatic dependency configuration,
welcome everyone to discuss my ideas.

## 1. Demand background

Currently, DS can only use DAG drawing to set up the workflow/node
dependency, or call the API to create the workflow and dependency based on
the data structure of the workflow. The data warehouse is generally
hierarchical design, the data production process is link type, there are
complex dependencies between layers, and there are many SQL scripts.
Manually creating dependencies is inconvenient for the maintenance of
large-scale workflows, and dependency configuration errors are not
convenient for troubleshooting.

It is possible to extract the table blood relationship by analyzing the
SQL statements in the SQL related nodes, and then automatically establish
the dependency relationship according to the table blood relationship. The
Master Server executes the workflow according to the supplemented
dependencies to ensure that the nodes execute in the order of dependencies.

## 2. Design Ideas

- Analyze SQL table blood relationship when saving workflow, and
automatically generate dependent configuration data (only for SQL related
nodes)
- Master Server automatically resolves dependencies based on nodes,
generates dependent nodes, and executes all node tasks
- The front-end node configuration page adds the "Automatically resolve
dependencies" switch to control whether to enable dependency detection
during execution of the node
- A dependency graph page is added to the front end for easy viewing of
node dependencies after automatic analysis

Insufficient:

- In the current design, the automatically generated default rule for
dependent nodes only supports judging whether the task status of the node
on the day is successful. The fixed configuration is checked every N
minutes for a total of M times. If the number is exceeded, it will be
treated as a failure.

## 3. Timing diagram

Please refer to the picture below

## 4. Table Design

Add node lineage relationship table: t_ds_node_lineage

| Column Name | Description |
| --------------------- | ------------------------|
| id | Auto-incrementing ID |
| process_definition_id | Workflow definition ID |
| process_node_id | Workflow node ID |
| lineage_type | Lineage type (1 input, 2 output) |
| lineage_union_key | Lineage only KEY |
| create_time | Creation time |


-------------------------------------------------------------------------------------------------------------------------------------------------------------

## 1.需求背景

当前DS只能通过DAG画图设置工作流/节点间依赖关系，或者根据工作流的数据结构调用API创建工作流及依赖关系。
而数仓一般是分层设计，数据的生产过程是链路式的，层与层之间存在复杂的依赖关系，SQL脚本众多。
手工创建依赖关系不便于大批量工作流的维护，依赖配置错误不方便排查。

可以通过解析SQL相关节点中的SQL语句，抽取表血缘关系，再根据表血缘关系自动建立依赖关系。
Master Server根据补充后的依赖关系执行工作流，保证节点按照依赖顺序执行。

## 2.设计思路

- 保存工作流时解析SQL的表血缘关系，自动生成依赖配置数据（仅限于SQL相关节点）
- Master Server根据节点自动解析依赖关系，生成依赖节点，执行所有节点任务
- 前端节点配置页面增加“自动解析依赖”开关，控制节点在执行时是否启用依赖检测
- 前端增加依赖图页面，方便查看自动解析后的节点依赖关系

不足：

- 当前设计中，自动生成的依赖节点默认规则仅支持判断当日节点任务状态是否成功，固定配置每隔N分钟检查一次，共检查M次，超过次数后作为失败处理

## 3.时序图
[cid:ii_kgsus5mg0]
[cid:ii_kgsusdlj1]

## 4.表设计

新增节点血缘关系表：t_ds_node_lineage
| 列名                  | 描述                     |
| --------------------- | ------------------------|
| id                    | 自增ID                   |
| process_definition_id | 工作流定义ID             |
| process_node_id       | 工作流节点ID             |
| lineage_type          | 血缘类型（1输入，2输出）   |
| lineage_union_key     | 血缘唯一KEY              |
| create_time           | 创建时间                 |

--------------------
DolphinScheduler(Incubator) Commtter
Hemin Wen  温合民
[email protected]<mailto:[email protected]>
--------------------

Re: [DISCUSS] Table lineage design

Reply via email to