Re: [DISCUSS] Table lineage design

Hemin Wen Mon, 16 Nov 2020 02:28:28 -0800

For the deficiencies in the design, add the following optimization:
    The front-end node configuration page adds the "Define Dependency
Cycle" configuration (including the dependency cycle, the number of
retries, and the retry interval) to specify the dependency rules that
depend on the current node
    Default rules can be set on the page


-------------------------------------------------------------------------------------------------------------------------------------------------------------

对于设计中不足，增加以下优化：
    前端节点配置页面增加“定义依赖周期”配置（依赖周期、重试次数），指定依赖当前节点的依赖规则
    页面上会给出默认规则


--------------------
DolphinScheduler(Incubator) Commtter
Hemin Wen  温合民
[email protected]
--------------------


Hemin Wen <[email protected]> 于2020年10月28日周三 上午11:49写道：

> Hi!
>
> The function of table lineage automatic dependency configuration,
> welcome everyone to discuss my ideas.
>
> ## 1. Demand background
>
>    Currently, DS can only use DAG drawing to set up the workflow/node
> dependency, or call the API to create the workflow and dependency based on
> the data structure of the workflow. The data warehouse is generally
> hierarchical design, the data production process is link type, there are
> complex dependencies between layers, and there are many SQL scripts.
> Manually creating dependencies is inconvenient for the maintenance of
> large-scale workflows, and dependency configuration errors are not
> convenient for troubleshooting.
>
>    It is possible to extract the table blood relationship by analyzing
> the SQL statements in the SQL related nodes, and then automatically
> establish the dependency relationship according to the table blood
> relationship. The Master Server executes the workflow according to the
> supplemented dependencies to ensure that the nodes execute in the order of
> dependencies.
>
> ## 2. Design Ideas
>
>    - Analyze SQL table blood relationship when saving workflow, and
> automatically generate dependent configuration data (only for SQL related
> nodes)
>    - Master Server automatically resolves dependencies based on nodes,
> generates dependent nodes, and executes all node tasks
>    - The front-end node configuration page adds the "Automatically
> resolve dependencies" switch to control whether to enable dependency
> detection during execution of the node
>    - A dependency graph page is added to the front end for easy viewing
> of node dependencies after automatic analysis
>
> Insufficient:
>
>    - In the current design, the automatically generated default rule for
> dependent nodes only supports judging whether the task status of the node
> on the day is successful. The fixed configuration is checked every N
> minutes for a total of M times. If the number is exceeded, it will be
> treated as a failure.
>
> ## 3. Timing diagram
>
>     Please refer to the picture below
>
> ## 4. Table Design
>
> Add node lineage relationship table: t_ds_node_lineage
>
> | Column Name | Description |
> | --------------------- | ------------------------|
> | id | Auto-incrementing ID |
> | process_definition_id | Workflow definition ID |
> | process_node_id | Workflow node ID |
> | lineage_type | Lineage type (1 input, 2 output) |
> | lineage_union_key | Lineage only KEY |
> | create_time | Creation time |
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> ## 1.需求背景
>
> 当前DS只能通过DAG画图设置工作流/节点间依赖关系，或者根据工作流的数据结构调用API创建工作流及依赖关系。
> 而数仓一般是分层设计，数据的生产过程是链路式的，层与层之间存在复杂的依赖关系，SQL脚本众多。
> 手工创建依赖关系不便于大批量工作流的维护，依赖配置错误不方便排查。
>
> 可以通过解析SQL相关节点中的SQL语句，抽取表血缘关系，再根据表血缘关系自动建立依赖关系。
> Master Server根据补充后的依赖关系执行工作流，保证节点按照依赖顺序执行。
>
> ## 2.设计思路
>
> - 保存工作流时解析SQL的表血缘关系，自动生成依赖配置数据（仅限于SQL相关节点）
> - Master Server根据节点自动解析依赖关系，生成依赖节点，执行所有节点任务
> - 前端节点配置页面增加“自动解析依赖”开关，控制节点在执行时是否启用依赖检测
> - 前端增加依赖图页面，方便查看自动解析后的节点依赖关系
>
> 不足：
>
> - 当前设计中，自动生成的依赖节点默认规则仅支持判断当日节点任务状态是否成功，固定配置每隔N分钟检查一次，共检查M次，超过次数后作为失败处理
>
> ## 3.时序图
> [image: image.png]
> [image: image.png]
>
> ## 4.表设计
>
> 新增节点血缘关系表：t_ds_node_lineage
> | 列名                  | 描述                     |
> | --------------------- | ------------------------|
> | id                    | 自增ID                   |
> | process_definition_id | 工作流定义ID             |
> | process_node_id       | 工作流节点ID             |
> | lineage_type          | 血缘类型（1输入，2输出）   |
> | lineage_union_key     | 血缘唯一KEY              |
> | create_time           | 创建时间                 |
>
> --------------------
> DolphinScheduler(Incubator) Commtter
> Hemin Wen  温合民
> [email protected]
> --------------------
>

Re: [DISCUSS] Table lineage design

Reply via email to