Re: [DISCUSS] Process definithon json split design

leon bao Tue, 29 Dec 2020 00:46:55 -0800

Thank you all！
According to this discussion, I have sorted out a task list about JSON
split.
I very much hope that people who are interested in implementing coding can
join us.
the issue :
https://github.com/apache/incubator-dolphinscheduler/issues/4325


Hemin Wen <[email protected]> 于2020年12月4日周五 下午2:24写道：

> According to the results of the discussion, the plan has been re-optimized
> and related development work:
>
> ## 1. Currently
> The workflow definition of the current DS system includes task definition
> data and task relationship data. In the design of the database, task data
> and task relationship data are stored in the workflow as a string type
> field (process_definition_json) Definition table (t_ds_process_definition).
>
> With the increase of workflow and tasks, the following problems will arise:
>
> -Task data, relational data and workflow data are coupled together, which
> is not friendly to the scenario of single-task scheduling. The task must be
> created in the workflow
>
> -The task cannot be reused because the task is created in the workflow
>
> -The maintenance cost is high. If you move the whole body and modify any
> task, you need to update the data in the workflow as a whole, and it also
> increases the log cost
>
> -When there are many tasks in the workflow, the efficiency of global search
> and statistical analysis is low, such as querying which tasks use which
> data source
>
> -Poor scalability, for example, the realization of blood relationship
> function in the future will only lead to more and more bloated workflow
> definitions
>
> -Tasks, relationships, and workflow boundaries are blurred. Condition nodes
> and delay nodes are also regarded as a task, which is actually a
> combination of relationships and conditions
>
> Based on the above pain points, we need to redefine the business boundaries
> of tasks, relationships, and workflows, and redesign their data structures
> based on this
>
> ## 2. Design Ideas
>
> ### 2.1 Workflow, relationship, job
>
> First of all, we set aside the current implementation and clarify the
> business boundaries of tasks (the subsequent description is changed to
> jobs), relationships, and workflows, and how to decouple
>
> -Job: the task to be executed by the scheduling system, the job only
> contains the data and resources needed to execute the job
> -Relationship: the relationship between the job and the job and the
> execution conditions, including the execution relationship (after A
> completes, execute B) and execution conditions (after A completes and
> succeeds, execute B; after A completes and fails, execute C; A completes 30
> After minutes, execute D)
> -Workflow: the carrier of a set of relationships, the workflow only saves
> the relationships between jobs (DAG is a form of presentation of workflow,
> a way to create relationships)
>
> Combined with the functions supported by the current DS, we can make a
> classification
>
> -Job: Dependency check, sub-process, Shell, stored procedure, Sql, Spark,
> Flink, MR, Python, Http, DataX, Sqoop
> -Relationship: serial execution, parallel execution, aggregate execution,
> conditional branch, delayed execution
> -Workflow: the boundary of scheduling execution, including a set of
> relationships
>
> #### 2.1.1 Further refinement
>
> The job definition data is not much different from the current job
> definition data. Both are composed of public fields and custom fields. You
> only need to remove the fields related to the relationship.
>
> The workflow definition data is not much different from the current
> workflow definition data, just remove the json field.
>
> Relational data, we can abstract into two nodes and one path according to
> classification. The node is the job. The path includes the conditional
> rules that need to be met from the pre-node to the post-node. The
> conditional rules include: unconditional, judgment condition, and delay
> condition.
>
> ### 2.2 Version Management
>
> We clarify the business boundaries. After decoupling, they become a
> reference relationship. The workflow and the relationship are one-to-many,
> and the relationship and the job are one-to-many. The definition data also
> needs to save the version record, which can support the restoration of
> historical data in the future.
>
> So the design idea here is:
>
> To define data, you need to add a version field
>
> The definition table needs to add the corresponding log table
>
> When creating definition data, double write to the definition table and log
> table. When modifying the definition data, save the modified version to the
> log table
>
> There is no need to save version information in the reference data of the
> definition table (quote the latest version)
>
> ### 2.3 Example data
>
> The current DB design already has workflow instance tables and task
> instance tables, and DS currently supports data changes in instance tables.
> The instance table cannot only save the code and version information of the
> definition table, but also needs to maintain detailed definition data.
> Therefore, it is necessary to split the workflow instance table into a
> workflow instance table and a job relationship table, and the task instance
> table is generally unchanged. The fields of the three instance tables are
> basically the same as those of the definition table.
>
> ### 2.4 Business Logo Design
>
> Here is also involved in the import and export of workflow and job
> definition data. According to the previous community discussion, business
> identification needs to be introduced. Each data in the workflow definition
> table and job definition table will have a business identification,
> relationship definition data, and dependent jobs Establish a reference
> relationship with the sub-workflow job through the business identifier. The
> specific realization of the business logo is the voting result of the plan
> to be designed.
>
> Related Issues:
> https://github.com/apache/incubator-dolphinscheduler/issues/3820
>
> Design plan:
>
> ## 3. Design plan
>
> ### 3.1 Table model design
>
> #### 3.1.1 Workflow definition table: t_ds_process_definithon
>
> | Column Name | Description |
> | ---- | ---- |
> | id | Self-incrementing ID |
> | code | Code (the original name field) |
> | version | Version |
> | description | description |
> | project_code | Project code |
> | release_state | Release state |
> | user_id | Owning user ID |
> | global_params | Global parameters |
> | flag | Whether the process is available: 0 is not available, 1 is
> available |
> | receivers | recipients |
> | receivers_cc | CC |
> | timeout | Timeout time |
> | tenant_id | tenant ID |
> | locations | Node coordinate information |
> | create_time | Creation time |
> | update_time | Modification time |
>
> #### 3.1.2 Workflow job relationship table: t_ds_process_task_relation
>
> Note: The last node has unconditional data and post data. Here you can
> imagine the two ends of a line, the left is the front node, the middle is
> the condition, and the right is the post node.
>
> | Column Name | Description |
> | ----------------------- | ------------------------- ------------- |
> | id | Self-incrementing ID |
> | project_code | Project code |
> | process_definition_code | Workflow coding |
> | pre_project_code | Pre-quoted project code |
> | pre_task_code | Pre-reference job code |
> | condition_type | Condition type 0: None 1: Judgment 2: Delayed |
> | condition_params | Condition parameters (json) |
> | post_project_code | Post reference project code |
> | post_task_code | Post reference job code |
> | create_time | Creation time |
> | update_time | Modification time |
>
> #### 3.1.3 Job definition table: t_ds_task_definithon
>
> | Column Name | Description |
> | ----------------------- | -------------- |
> | id | Self-incrementing ID |
> | code | Code (the original name field) |
> | version | Version |
> | description | description |
> | project_code | Project code |
> | task_type | Job type |
> | task_params | Job custom parameters |
> | run_flag | Run flag |
> | task_priority | Job priority |
> | worker_group | worker group |
> | fail_retry_times | Number of failed retries |
> | fail_retry_interval | Failure retry interval |
> | timeout_flag | Timeout flag |
> | timeout_notify_strategy | Timeout notification strategy |
> | timeout_duration | Timeout duration |
> | create_time | Creation time |
> | update_time | Modification time |
>
> #### 3.1.4 Workflow definition log table: t_ds_process_definithon_log
>
> Add operation type (add, modify, delete), operator, and operation time
> based on the workflow definition table
>
> #### 3.1.5 Workflow job relationship log table:
> t_ds_process_task_relation_log
>
> Add workflow version, operation type (add, modify, delete), operator,
> operation time based on the job relationship table
>
> #### 3.1.6 Job definition log table: t_ds_task_definithon_log
>
> Add operation type (add, modify, delete), operator, and operation time
> based on the job definition table
>
> ### 3.2 Master-Worker scheduling design
>
> When the Master schedules the workflow, it queries the workflow details and
> all job relationship data according to the project code and workflow code
> (job data is not loaded here), generates a DAG, traverses the DAG job, and
> sends the project code and job code to the Worker. Project code, job code
> query detailed job data and execute the job
>
> ## 4. Related work split
>
> ### 4.1 Frontend
>
> Added job management related functions, including: job list, job creation,
> update, delete, view details operations
>
> To create a workflow page, you need to pass workflow information, job
> relationship information, and job information to the back-end API layer to
> save/update
>
> Workflow page, when dragging and dropping task nodes, it also supports
> reference project-job (default current search job under current project)
> and create job operation
>
> ### 4.2 API layer
>
> Added job data related processing interface, including version processing
> (query, create, modify, delete, online and offline...)
>
> Refactored workflow data related processing interface, including version
> processing (query, create, modify, delete, import, export, online and
> offline...)
>
> Refactored the processing interface of workflow instance data (query,
> modify, Gantt chart)
>
> Refactoring job instance query interface
>
> Refactored workflow instance, job instance related statistical interface
> (UI system homepage, project homepage statistical data, related monitoring
> data)
>
> ### 3.3 Master
>
> Rebuild Master according to the <3.2 Master-Worker Scheduling Design>
> scheme
>
> ### 3.4 Worker
>
> Refactor Worker according to the <3.2 Master-Worker Scheduling Design>
> scheme
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> 根据讨论结果，重新优化了方案，还有相关开发工作
>
> ## 1.现状
>
> 当前DS系统的工作流定义包含了任务定义数据和任务之间关系数据，并且在数据库的设计上，任务数据和任务关系数据是以一个字符串类型字段（process_definition_json）的方式，保存在工作流定义表（t_ds_process_definition）中。
>
> 随着工作流和任务的增加，会产生如下问题：
>
> - 任务数据、关系数据和工作流数据耦合在一起，对单任务调度的场景不友好，任务必须创建在工作流内
>
> - 任务无法复用，因为任务是创建在工作流内的
>
> - 维护成本高，牵一发动全身，修改任何一个任务，都需要整体更新工作流内数据，同时也增加了日志成本
>
> - 工作流内任务较多时，全局搜索和统计分析效率低，例如查询哪些任务用到了哪个数据源
>
> - 扩展性差，例如未来要实现血缘功能，只会导致工作流定义越来越臃肿
>
> - 任务、关系、工作流边界模糊，条件节点、延迟节点也被当做一种任务，实际是关系与条件的组合
>
> 基于以上痛点，我们需要重新定义任务、关系、工作流的业务边界，基于此重新设计它们的数据结构
>
> ## 2.设计思路
>
> ### 2.1 工作流、关系、作业
>
> 首先，我们抛开当前的实现，明确任务（后续描述更改为作业）、关系、工作流的业务边界，如何去解耦
>
> - 作业：调度系统要执行的任务，作业内只包含执行作业所需要的数据和资源
> -
>
> 关系：作业与作业之间的关系以及执行条件，包含执行关系（A完成后，执行B）和执行条件（A完成并成功后，执行B；A完成并失败后，执行C；A完成30分钟后，执行D）
> - 工作流：一组关系的载体，工作流只保存作业间的关系（DAG是工作流的一种展示形式，创建关系的一种方式）
>
> 结合当前DS支持的功能，我们可以做一个分类
>
> - 作业：依赖检查、子流程、Shell、存储过程、Sql、Spark、Flink、MR、Python、Http、DataX、Sqoop
> - 关系：串行执行、并行执行、聚合执行、条件分支、延迟执行
> - 工作流：调度执行的边界，包含一组关系
>
> #### 2.1.1 进一步细化
>
> 作业定义数据，和当前的作业定义数据差别不大，都是由公共字段和自定义字段组成，只需要去掉关系相关的字段就可以了。
>
> 工作流定义数据，和当前的工作流定义数据差别也不大，去掉json字段就可以了。
>
>
> 关系数据，我们根据分类可以抽象为两个节点和一个路径。节点就是作业，路径包含前置节点到后置节点需要满足的条件规则是什么，条件规则包含：无条件、判断条件、延迟条件。
>
> ### 2.2 版本管理
>
>
> 我们明确业务边界，解耦后它们之间就变成了引用关系，工作流和关系之间是一对多，关系和作业之间是一对多。定义数据还需要保存版本记录，后续可以支持恢复历史数据。
>
> 所以这里的设计思路是：
>
> 定义数据需要增加版本字段
>
> 定义表需要增加对应的日志表
>
> 创建定义数据时，双写到定义表和日志表，修改定义数据时，保存修改后的版本到日志表
>
> 定义表的引用数据中不需要保存版本信息（引用最新版本）
>
> ### 2.3 实例数据
>
>
> 当前DB设计已经有工作流实例表、任务实例表，而且DS当前支持实例表的数据变更。实例表不能仅保存定义表的编码和版本信息，还需要维护详细的定义数据。所以需要将工作流实例表拆分为工作流实例表和作业关系表，任务实例表总体不变。三张实例表的字段和定义表字段基本一致。
>
> ### 2.4 业务标识设计
>
>
> 这里还涉及到工作流、作业定义数据导入导出问题，根据之前社区讨论的方案，需要引入业务标识，工作流定义表和作业定义表每条数据都会有一个业务标识，关系定义数据、依赖作业和子工作流作业内部通过业务标识建立引用关系。业务标识的具体实现待设计方案的投票结果。
>
> 相关Issue：https://github.com/apache/incubator-dolphinscheduler/issues/3820
>
> 设计方案：
>
> ## 3.设计方案
>
> ### 3.1 表模型设计
>
> #### 3.1.1 工作流定义表：t_ds_process_definithon
>
> | 列名 | 描述 |
> | ---- | ---- |
> | id            | 自增ID                         |
> | code          | 编码（原name字段）               |
> | version       | 版本                           |
> | description   | 描述                           |
> | project_code  | 项目编码                       |
> | release_state | 发布状态                       |
> | user_id       | 所属用户ID                     |
> | global_params | 全局参数                       |
> | flag          | 流程是否可用：0 不可用，1 可用 |
> | receivers     | 收件人                         |
> | receivers_cc  | 抄送人                         |
> | timeout       | 超时时间                       |
> | tenant_id     | 租户ID                         |
> | locations     | 节点坐标信息                    |
> | create_time   | 创建时间                       |
> | update_time   | 修改时间                       |
>
> #### 3.1.2 工作流作业关系表：t_ds_process_task_relation
>
> 注：最后一个节点无条件数据和后置数据，这里可以想象一条线的两端，左边是前置节点，中间是条件，右边是后置节点
>
> | 列名                    | 描述                                   |
> | ----------------------- | -------------------------------------- |
> | id                      | 自增ID                                 |
> | project_code            | 项目编码                               |
> | process_definition_code | 工作流编码                             |
> | pre_project_code        | 前置引用项目编码                        |
> | pre_task_code           | 前置引用作业编码                        |
> | condition_type          | 条件类型 0：无 1：判断 2：延迟            |
> | condition_params        | 条件参数（json）                        |
> | post_project_code       | 后置引用项目编码                        |
> | post_task_code          | 后置引用作业编码                        |
> | create_time             | 创建时间                               |
> | update_time             | 修改时间                               |
>
> #### 3.1.3 作业定义表：t_ds_task_definithon
>
> | 列名                    | 描述           |
> | ----------------------- | -------------- |
> | id                      | 自增ID         |
> | code                    | 编码（原name字段） |
> | version                 | 版本           |
> | description             | 描述           |
> | project_code            | 项目编码        |
> | task_type               | 作业类型       |
> | task_params             | 作业自定义参数 |
> | run_flag                | 运行标志       |
> | task_priority           | 作业优先级     |
> | worker_group            | worker分组     |
> | fail_retry_times        | 失败重试次数   |
> | fail_retry_interval     | 失败重试间隔   |
> | timeout_flag            | 超时标志       |
> | timeout_notify_strategy | 超时通知策略   |
> | timeout_duration        | 超时时长       |
> | create_time             | 创建时间       |
> | update_time             | 修改时间       |
>
> #### 3.1.4 工作流定义日志表：t_ds_process_definithon_log
>
> 工作流定义表基础上增加操作类型（新增、修改、删除）、操作人、操作时间
>
> #### 3.1.5 工作流作业关系日志表：t_ds_process_task_relation_log
>
> 作业关系表基础上增加工作流版本、操作类型（新增、修改、删除）、操作人、操作时间
>
> #### 3.1.6 作业定义日志表：t_ds_task_definithon_log
>
> 作业定义表基础上增加操作类型（新增、修改、删除）、操作人、操作时间
>
> ### 3.2 Master-Worker调度设计
>
>
> Master调度工作流时，根据项目编码、工作流编码查询工作流详细信息和所有作业关系数据（这里不加载作业数据），生成DAG，遍历DAG作业，发送项目编码、作业编码到Worker，Worker根据项目编码、作业编码查询作业详细数据并执行作业
>
> ## 4.相关工作拆分
>
> ### 4.1 前端
>
> 增加作业管理相关功能，包括：作业列表，作业的创建、更新、删除、查看详情操作
>
> 创建工作流页面，需要将工作流信息、作业关系信息、作业信息传给后端API层保存/更新
>
> 工作流页面，拖拽任务节点时，同时支持引用项目-作业（默认当前搜索当前项目下作业）和创建作业操作
>
> ### 4.2 API层
>
> 增加作业数据相关处理接口，包含版本处理（查询、新建、修改、删除、上下线...）
>
> 重构工作流数据相关处理接口，包含版本处理（查询、新建、修改、删除、导入、导出、上下线...）
>
> 重构工作流实例数据相关处理接口（查询、修改、甘特图）
>
> 重构作业实例查询接口
>
> 重构工作流实例、作业实例相关统计接口（UI系统首页、项目首页统计数据、相关监控数据）
>
> ### 3.3 Master
>
> 根据《3.2 Master-Worker调度设计》方案重构Master
>
> ### 3.4 Worker
>
> 根据《3.2 Master-Worker调度设计》方案重构Worker
>
> --------------------
> DolphinScheduler(Incubator) Commtter
> Hemin Wen  温合民
> [email protected]
> --------------------
>
>
> Hemin Wen <[email protected]> 于2020年11月25日周三 上午10:01写道：
>
> > Hi!
> >
> > About json splitting of workflow definition, The following is the design
> > plan for splitting three tables.
> >
> > Everyone can discuss together.
> >
> >
> >
> --------------------------------------------------------------------------------------------------------------
> >
> > ## 1. Currently
> > The workflow definition of the current DS system includes task definition
> > data and task relationship data. In the design of the database, task data
> > and task relationship data are stored in the workflow as a string type
> > field (process_definition_json) Definition table
> (t_ds_process_definition).
> >
> > With the increase of workflow and tasks, the following problems will
> arise:
> >
> > -Task data, relational data and workflow data are coupled together, which
> > is not friendly to the scenario of single-task scheduling. The task must
> be
> > created in the workflow
> >
> > -The task cannot be reused because the task is created in the workflow
> >
> > -The maintenance cost is high. If you move the whole body and modify any
> > task, you need to update the data in the workflow as a whole, and it also
> > increases the log cost
> >
> > -When there are many tasks in the workflow, the efficiency of global
> > search and statistical analysis is low, such as querying which tasks use
> > which data source
> >
> > -Poor scalability, for example, the realization of blood relationship
> > function in the future will only lead to more and more bloated workflow
> > definitions
> >
> > -Tasks, relationships, and workflow boundaries are blurred. Condition
> > nodes and delay nodes are also regarded as a task, which is actually a
> > combination of relationships and conditions
> >
> > Based on the above pain points, we need to redefine the business
> > boundaries of tasks, relationships, and workflows, and redesign their
> data
> > structures based on this
> >
> > ## 2. Design Ideas
> >
> > ### 2.1 Workflow, relation, job
> >
> > First of all, we set aside the current implementation and clarify the
> > business boundaries of tasks (the subsequent description is changed to
> > jobs), relationships, and workflows, and how to decouple
> >
> > -Job: the task that the scheduling system really needs to execute, the
> job
> > only contains the data and resources needed to execute the job
> > -relation: the relationship between the job and the job and the execution
> > conditions, including the execution relationship (after A completes,
> > execute B) and execution conditions (after A completes and succeeds,
> > execute B; after A completes and fails, execute C; A completes 30 After
> > minutes, execute D)
> > -Workflow: the carrier of a set of relationships, the workflow only saves
> > the relationships between jobs (DAG is a display form of workflow, a way
> to
> > create relationships)
> >
> > Combined with the functions supported by the current DS, we can make a
> > classification
> >
> > -Job: Dependency check, sub-process, Shell, stored procedure, Sql, Spark,
> > Flink, MR, Python, Http, DataX, Sqoop
> > -Relationship: serial execution, parallel execution, aggregate execution,
> > conditional branch, delayed execution
> > -Workflow: the boundary of scheduling execution, including a set of
> > relationships
> >
> > #### 2.1.1 Further refinement
> >
> > The job definition data is not much different from the current job
> > definition data. Both are composed of public fields and custom fields.
> You
> > only need to remove the fields related to the relationship.
> >
> > The workflow definition data is not much different from the current
> > workflow definition data, just remove the json field.
> >
> > Relational data, we can abstract into two nodes and one path according to
> > classification. The node is the job, and the path includes the
> conditional
> > rules that need to be met from the pre-node to the post-node. The
> > conditional rules include: unconditional, judgment condition, and delay
> > condition.
> >
> > ### 2.2 Version Management
> >
> > We clarify the business boundaries. After decoupling, they become a
> > reference relationship. The workflow and the relationship are
> one-to-many,
> > and the relationship and the job are one-to-many. Not only is the
> > definition of data, we also need to consider instance data. Every time a
> > workflow is scheduled and executed, a workflow instance will be
> generated.
> > Jobs and workflows can be changed, and the workflow instance must support
> > viewing, rerun, recovery failure, etc. . This requires the introduction
> of
> > version management of the definition data. Every time workflow,
> > relationship, and job changes need to save old version data and generate
> > new version data.
> >
> > So the design idea here is:
> >
> > To define data, you need to add a version field
> >
> > The definition table needs to add the corresponding log table
> >
> > When creating definition data, double write to the definition table and
> > log table. When modifying the definition data, save the modified version
> to
> > the log table
> >
> > There is no need to save version information in the reference data of the
> > definition table (refer to the latest version), and the version
> information
> > at the time of execution is saved in the instance data
> >
> > ### 2.3 Coding Design
> >
> > This also involves the import and export of workflow and job definition
> > data. According to the previous community discussion, a coding scheme
> needs
> > to be introduced. Each piece of data in workflow, relationship, and job
> > will have a unique code. Related Issues: https://github
> > .com/apache/incubator-dolphinscheduler/issues/3820
> >
> > Resource: RESOURCE_xxx
> >
> > Task: TASK_xxx
> >
> > Relation: RELATION_xxx
> >
> > Workflow: PROCESS_xxx
> >
> > Project: PROJECT_xxx
> >
> > ## 3. Design plan
> >
> > ### 3.1 Table model design
> >
> > #### 3.1.1 Job definition table: t_ds_task_definithon
> >
> > | Column Name | Description |
> > | ----------------------- | -------------- |
> > | id | Self-incrementing ID |
> > | union_code | unique code |
> > | version | Version |
> > | name | Job name |
> > | description | description |
> > | task_type | Job type |
> > | task_params | Job custom parameters |
> > | run_flag | Run flag |
> > | task_priority | Job priority |
> > | worker_group | worker group |
> > | fail_retry_times | Number of failed retries |
> > | fail_retry_interval | Failure retry interval |
> > | timeout_flag | Timeout flag |
> > | timeout_notify_strategy | Timeout notification strategy |
> > | timeout_duration | Timeout duration |
> > | create_time | Creation time |
> > | update_time | Modification time |
> >
> > #### 3.1.2 Task relation table: t_ds_task_relation
> >
> > | Column Name | Description |
> > | ----------------------- | ------------------------- ------------- |
> > | id | Self-incrementing ID |
> > | union_code | unique code |
> > | version | Version |
> > | process_definition_code | Workflow coding |
> > | node_code | Node code (workflow code/job code) |
> > | post_node_code | Post node code (workflow code/job code) |
> > | condition_type | Condition type 0: None 1: Judgment condition 2: Delay
> > condition |
> > | condition_params | Condition parameters |
> > | create_time | Creation time |
> > | update_time | Modification time |
> >
> > #### 3.1.3 Workflow definition table: t_ds_process_definithon
> >
> > | Column Name | Description |
> > | ---- | ---- |
> > | id | Self-incrementing ID |
> > | union_code | unique code |
> > | version | Version |
> > | name | Workflow name |
> > | project_code | Project code |
> > | release_state | Release state |
> > | user_id | Owning user ID |
> > | description | description |
> > | global_params | Global parameters |
> > | flag | Whether the process is available: 0 is not available, 1 is
> > available |
> > | receivers | recipients |
> > | receivers_cc | CC |
> > | timeout | Timeout time |
> > | tenant_id | tenant ID |
> > | create_time | Creation time |
> > | update_time | Modification time |
> >
> > #### 3.1.4 Job definition log table: t_ds_task_definithon_log
> >
> > Add operation type (add, modify, delete), operator, and operation time
> > based on the job definition table
> >
> > #### 3.1.5 Job relation log table: t_ds_task_relation_log
> >
> > Add operation type (add, modify, delete), operator, and operation time
> > based on the job relationship table
> >
> > #### 3.1.6 Workflow definition log table: t_ds_process_definithon_log
> >
> > Add operation type (add, modify, delete), operator, and operation time
> > based on the workflow definition table
> >
> > ### 3.2 Frontend
> >
> > *The design here is just a personal idea, and the front-end help is
> needed
> > to design the interaction*
> >
> > Need to add job management related functions, including: job list, job
> > creation, update, delete, view details operations
> >
> > To create a workflow page, you need to split json into workflow
> definition
> > data and job relationship data to the back-end API layer to save/update
> >
> > Workflow page, when dragging task nodes, add reference job options
> >
> > The conditional branch nodes and delay nodes need to be resolved into the
> > conditional rule data in the relationship; conversely, the conditional
> rule
> > data returned by the backend needs to be displayed as the corresponding
> > node when querying the workflow
> >
> > ### 3.3 Master
> >
> > When the Master schedules the workflow, you need to modify <Build dag
> from
> > json> to <Build dag from relational data>. When executing a workflow,
> first
> > load the relational data in full (no job data is loaded here), generate
> > DAG, and traverse DAG execution , And then get the job data that needs to
> > be executed
> >
> > Other execution processes are consistent with existing processes
> >
> >
> >
> --------------------------------------------------------------------------------------------------------------
> >
> > ## 1.现状
> >
> >
> 当前DS系统的工作流定义包含了任务定义数据和任务之间关系数据，并且在数据库的设计上，任务数据和任务关系数据是以一个字符串类型字段（process_definition_json）的方式，保存在工作流定义表（t_ds_process_definition）中。
> >
> > 随着工作流和任务的增加，会产生如下问题：
> >
> > - 任务数据、关系数据和工作流数据耦合在一起，对单任务调度的场景不友好，任务必须创建在工作流内
> >
> > - 任务无法复用，因为任务是创建在工作流内的
> >
> > - 维护成本高，牵一发动全身，修改任何一个任务，都需要整体更新工作流内数据，同时也增加了日志成本
> >
> > - 工作流内任务较多时，全局搜索和统计分析效率低，例如查询哪些任务用到了哪个数据源
> >
> > - 扩展性差，例如未来要实现血缘功能，只会导致工作流定义越来越臃肿
> >
> > - 任务、关系、工作流边界模糊，条件节点、延迟节点也被当做一种任务，实际是关系与条件的组合
> >
> > 基于以上痛点，我们需要重新定义任务、关系、工作流的业务边界，基于此重新设计它们的数据结构
> >
> > ## 2.设计思路
> >
> > ### 2.1 工作流、关系、作业
> >
> > 首先，我们抛开当前的实现，明确任务（后续描述更改为作业）、关系、工作流的业务边界，如何去解耦
> >
> > - 作业：调度系统要执行的任务，作业内只包含执行作业所需要的数据和资源
> > -
> >
> 关系：作业与作业之间的关系以及执行条件，包含执行关系（A完成后，执行B）和执行条件（A完成并成功后，执行B；A完成并失败后，执行C；A完成30分钟后，执行D）
> > - 工作流：一组关系的载体，工作流只保存作业间的关系（DAG是工作流的一种展示形式，创建关系的一种方式）
> >
> > 结合当前DS支持的功能，我们可以做一个分类
> >
> > - 作业：依赖检查、子流程、Shell、存储过程、Sql、Spark、Flink、MR、Python、Http、DataX、Sqoop
> > - 关系：串行执行、并行执行、聚合执行、条件分支、延迟执行
> > - 工作流：调度执行的边界，包含一组关系
> >
> > #### 2.1.1 进一步细化
> >
> > 作业定义数据，和当前的作业定义数据差别不大，都是由公共字段和自定义字段组成，只需要去掉关系相关的字段就可以了。
> >
> > 工作流定义数据，和当前的工作流定义数据差别也不大，去掉json字段就可以了。
> >
> >
> >
> 关系数据，我们根据分类可以抽象为两个节点和一个路径。节点就是作业，路径包含前置节点到后置节点需要满足的条件规则是什么，条件规则包含：无条件、判断条件、延迟条件。
> >
> > ### 2.2 版本管理
> >
> >
> >
> 我们明确业务边界，解耦后它们之间就变成了引用关系，工作流和关系之间是一对多，关系和作业之间是一对多。不仅是定义数据，我们还要考虑实例数据，每次工作流的调度执行都会产生工作流实例，作业和工作流都是可以变更的，而工作流实例又要支持查看、重跑、恢复失败等。这就需要引入定义数据的版本管理了。每一次工作流、关系、作业变更都需要保存旧版本数据，生成新版本数据。
> >
> > 所以这里的设计思路是：
> >
> > 定义数据需要增加版本字段
> >
> > 定义表需要增加对应的日志表
> >
> > 创建定义数据时，双写到定义表和日志表，修改定义数据时，保存修改后的版本到日志表
> >
> > 定义表的引用数据中不需要保存版本信息（引用最新版本），实例数据中保存执行时的版本信息
> >
> > ### 2.3 编码设计
> >
> >
> 这里还涉及到工作流、作业定义数据导入导出问题，根据之前社区讨论的方案，需要引入编码方案，工作流、关系、作业每条数据都会有一个唯一编码，相关Issue：
> > https://github.com/apache/incubator-dolphinscheduler/issues/3820
> >
> > 资源：RESOURCE_xxx
> >
> > 作业：TASK_xxx
> >
> > 关系：RELATION_xxx
> >
> > 工作流：PROCESS_xxx
> >
> > 项目：PROJECT_xxx
> >
> > ## 3.设计方案
> >
> > ### 3.1 表模型设计
> >
> > #### 3.1.1 作业定义表：t_ds_task_definithon
> >
> > | 列名                    | 描述           |
> > | ----------------------- | -------------- |
> > | id                      | 自增ID         |
> > | union_code              | 唯一编码       |
> > | version                 | 版本           |
> > | name                    | 作业名称       |
> > | description             | 描述           |
> > | task_type               | 作业类型       |
> > | task_params             | 作业自定义参数 |
> > | run_flag                | 运行标志       |
> > | task_priority           | 作业优先级     |
> > | worker_group            | worker分组     |
> > | fail_retry_times        | 失败重试次数   |
> > | fail_retry_interval     | 失败重试间隔   |
> > | timeout_flag            | 超时标志       |
> > | timeout_notify_strategy | 超时通知策略   |
> > | timeout_duration        | 超时时长       |
> > | create_time             | 创建时间       |
> > | update_time             | 修改时间       |
> >
> > #### 3.1.2 作业关系表：t_ds_task_relation
> >
> > | 列名                    | 描述                                   |
> > | ----------------------- | -------------------------------------- |
> > | id                      | 自增ID                                 |
> > | union_code              | 唯一编码                               |
> > | version                 | 版本                                   |
> > | process_definition_code | 工作流编码                             |
> > | node_code               | 节点编码（工作流编码/作业编码）        |
> > | post_node_code          | 后置节点编码（工作流编码/作业编码）    |
> > | condition_type          | 条件类型 0：无 1：判断条件 2：延迟条件 |
> > | condition_params        | 条件参数                               |
> > | create_time             | 创建时间                               |
> > | update_time             | 修改时间                               |
> >
> > #### 3.1.3 工作流定义表：t_ds_process_definithon
> >
> > | 列名 | 描述 |
> > | ---- | ---- |
> > | id            | 自增ID                         |
> > | union_code    | 唯一编码                       |
> > | version       | 版本                           |
> > | name          | 工作流名称                     |
> > | project_code  | 项目编码                       |
> > | release_state | 发布状态                       |
> > | user_id       | 所属用户ID                     |
> > | description   | 描述                           |
> > | global_params | 全局参数                       |
> > | flag          | 流程是否可用：0 不可用，1 可用 |
> > | receivers     | 收件人                         |
> > | receivers_cc  | 抄送人                         |
> > | timeout       | 超时时间                       |
> > | tenant_id     | 租户ID                         |
> > | create_time   | 创建时间                       |
> > | update_time   | 修改时间                       |
> >
> > #### 3.1.4 作业定义日志表：t_ds_task_definithon_log
> >
> > 作业定义表基础上增加操作类型（新增、修改、删除）、操作人、操作时间
> >
> > #### 3.1.5 作业关系日志表：t_ds_task_relation_log
> >
> > 作业关系表基础上增加操作类型（新增、修改、删除）、操作人、操作时间
> >
> > #### 3.1.6 工作流定义日志表：t_ds_process_definithon_log
> >
> > 工作流定义表基础上增加操作类型（新增、修改、删除）、操作人、操作时间
> >
> > ### 3.2 前端
> >
> > *这里的设计只是个人想法，交互上还需要前端帮助设计下*
> >
> > 需要增加作业管理相关功能，包括：作业列表，作业的创建、更新、删除、查看详情操作
> >
> > 创建工作流页面，需要将json拆分为工作流定义数据、作业关系数据传给后端API层保存/更新
> >
> > 工作流页面，拖拽任务节点时，增加引用作业选项
> >
> > 条件分支节点、延迟节点需要解析为关系中的条件规则数据；反之，查询工作流时需要将后端返回的条件规则数据展示为对应的节点
> >
> > ### 3.3 Master
> >
> >
> >
> Master调度工作流时，需要将<从json构建dag>修改为<从关系数据构建dag>，执行一个工作流时先全量加载关系数据（这里不加载作业数据），生成DAG，遍历DAG执行时，再获取需要执行的作业数据
> >
> > 其他执行流程和现有流程一致
> >
> > --------------------
> > DolphinScheduler(Incubator) Commtter
> > Hemin Wen  温合民
> > [email protected]
> > --------------------
> >
>


-- 
DolphinScheduler(Incubator)  PPMC
BaoLiang 鲍亮
[email protected]

Re: [DISCUSS] Process definithon json split design

Reply via email to