[dolphinscheduler] branch dev updated: [doc] Update task DataX document (#10218)

kerwin Mon, 23 May 2022 22:45:38 -0700

This is an automated email from the ASF dual-hosted git repository.

kerwin pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/dolphinscheduler.git



The following commit(s) were added to refs/heads/dev by this push:
     new f4b7754952 [doc] Update task DataX document (#10218)
f4b7754952 is described below

commit f4b7754952f16f60f1850d672e25e05a2088c49d
Author: Jiajie Zhong <[email protected]>
AuthorDate: Tue May 24 13:45:26 2022 +0800

    [doc] Update task DataX document (#10218)
---
 docs/docs/en/guide/task/datax.md | 126 +++++++++++++++++++--------------------
 docs/docs/zh/guide/task/datax.md | 122 +++++++++++++++++++------------------
 docs/img/datax_edit.png          | Bin 478215 -> 0 bytes
 3 files changed, 126 insertions(+), 122 deletions(-)

diff --git a/docs/docs/en/guide/task/datax.md b/docs/docs/en/guide/task/datax.md
index 20cdec8588..2413d360c9 100644
--- a/docs/docs/en/guide/task/datax.md
+++ b/docs/docs/en/guide/task/datax.md
@@ -1,63 +1,63 @@
-# DataX
-
-## Overview
-
-DataX task type for executing DataX programs. For DataX nodes, the worker will 
execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file.
-
-## Create Task
-
-- Click `Project -> Management-Project -> Name-Workflow Definition`, and click 
the `Create Workflow` button to enter the DAG editing page.
-- Drag from the toolbar <img src="/img/tasks/icons/datax.png" width="15"/> 
task node to canvas.
-
-## Task Parameter
-
-- **Node name**: The node name in a workflow definition is unique.
-- **Run flag**: Identifies whether this node schedules normally, if it does 
not need to execute, select the `prohibition execution`.
-- **Descriptive information**: Describe the function of the node.
-- **Task priority**: When the number of worker threads is insufficient, 
execute in the order of priority from high to low, and tasks with the same 
priority will execute in a first-in first-out order.
-- **Worker grouping**: Assign tasks to the machines of the worker group to 
execute. If `Default` is selected, randomly select a worker machine for 
execution.
-- **Environment Name**: Configure the environment name in which run the script.
-- **Times of failed retry attempts**: The number of times the task failed to 
resubmit.
-- **Failed retry interval**: The time interval (unit minute) for resubmitting 
the task after a failed task.
-- **Delayed execution time**: The time (unit minute) that a task delays in 
execution.
-- **Timeout alarm**: Check the timeout alarm and timeout failure. When the 
task runs exceed the "timeout", an alarm email will send and the task execution 
will fail.
-- **Custom template**: Customize the content of the DataX node's JSON profile 
when the default DataSource provided does not meet the requirements.
-- **JSON**: JSON configuration file for DataX synchronization.
-- **Custom parameters**: SQL task type, and stored procedure is a custom 
parameter order, to set customized parameter type and data type for the method 
is the same as the stored procedure task type. The difference is that the 
custom parameter of the SQL task type replaces the `${variable}` in the SQL 
statement.
-- **Data source**: Select the data source to extract data.
-- **SQL statement**: The SQL statement used to extract data from the target 
database, the SQL query column name is automatically parsed when execute the 
node, and mapped to the target table to synchronize column name. When the 
column names of the source table and the target table are inconsistent, they 
can be converted by column alias (as)
-- **Target library**: Select the target library for data synchronization.
-- **Pre-SQL**: Pre-SQL executes before the SQL statement (executed by the 
target database).
-- **Post-SQL**: Post-SQL executes after the SQL statement (executed by the 
target database).
-- **Stream limit (number of bytes)**: Limit the number of bytes for a query.
-- **Limit flow (number of records)**: Limit the number of records for a query.
-- **Running memory**: Set the minimum and maximum memory required, which can 
be set according to the actual production environment.
-- **Predecessor task**: Selecting a predecessor task for the current task, 
will set the selected predecessor task as upstream of the current task.
-
-## Task Example
-
-This example demonstrates how to import data from Hive into MySQL.
-
-### Configure the DataX environment in DolphinScheduler
-
-If you are using the DataX task type in a production environment, it is 
necessary to configure the required environment first. The following is the 
configuration file: `bin/env/dolphinscheduler_env.sh`.
-
-![datax_task01](/img/tasks/demo/datax_task01.png)
-
-After finish the environment configuration, need to restart DolphinScheduler.
-
-### Configure DataX Task Node
-
-As the default DataSource does not contain data read from Hive, require a 
custom JSON, refer to: [HDFS 
Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md).
 Note: Partition directories exist on the HDFS path, when importing data in 
real world situations, partitioning is recommended to be passed as a parameter, 
using custom parameters.
-
-After finish the required JSON file, you can configure the node by following 
the steps in the diagram below:
-
-![datax_task02](/img/tasks/demo/datax_task02.png)
-
-### View Execution Result
-
-![datax_task03](/img/tasks/demo/datax_task03.png)
-
-### Notice
-
-If the default DataSource provided does not meet your needs, you can configure 
the writer and reader of the DataX according to the actual usage environment in 
the custom template options, available at 
[DataX](https://github.com/alibaba/DataX).
+# DataX
+
+## Overview
+
+DataX task type for executing DataX programs. For DataX nodes, the worker will 
execute `${DATAX_HOME}/bin/datax.py` to analyze the input json file.
+
+## Create Task
+
+- Click Project Management -> Project Name -> Workflow Definition, and click 
the "Create Workflow" button to enter the DAG editing page.
+- Drag the <img src="/img/tasks/icons/datax.png" width="15"/> from the toolbar 
to the drawing board.
+
+## Task Parameter
+
+- **Node name**: The node name in a workflow definition is unique.
+- **Run flag**: Identifies whether this node can be scheduled normally, if it 
does not need to be executed, you can turn on the prohibition switch.
+- **Descriptive information**: describe the function of the node.
+- **Task priority**: When the number of worker threads is insufficient, they 
are executed in order from high to low, and when the priority is the same, they 
are executed according to the first-in first-out principle.
+- **Worker grouping**: Tasks are assigned to the machines of the worker group 
to execute. If Default is selected, a worker machine will be randomly selected 
for execution.
+- **Environment Name**: Configure the environment name in which to run the 
script.
+- **Number of failed retry attempts**: The number of times the task failed to 
be resubmitted.
+- **Failed retry interval**: The time, in cents, interval for resubmitting the 
task after a failed task.
+- **Delayed execution time**: The time, in cents, that a task is delayed in 
execution.
+- **Timeout alarm**: Check the timeout alarm and timeout failure. When the 
task exceeds the "timeout period", an alarm email will be sent and the task 
execution will fail.
+- **Custom template**: Custom the content of the DataX node's json profile 
when the default data source provided does not meet the required requirements.
+- **json**: json configuration file for DataX synchronization.
+- **Custom parameters**: SQL task type, and stored procedure is a custom 
parameter order to set values for the method. The custom parameter type and 
data type are the same as the stored procedure task type. The difference is 
that the SQL task type custom parameter will replace the \${variable} in the 
SQL statement.
+- **Data source**: Select the data source from which the data will be 
extracted.
+- **sql statement**: the sql statement used to extract data from the target 
database, the sql query column name is automatically parsed when the node is 
executed, and mapped to the target table synchronization column name. When the 
source table and target table column names are inconsistent, they can be 
converted by column alias.
+- **Target library**: Select the target library for data synchronization.
+- **Pre-sql**: Pre-sql is executed before the sql statement (executed by the 
target library).
+- **Post-sql**: Post-sql is executed after the sql statement (executed by the 
target library).
+- **Stream limit (number of bytes)**: Limits the number of bytes in the query.
+- **Limit flow (number of records)**: Limit the number of records for a query.
+- **Running memory**: the minimum and maximum memory required can be 
configured to suit the actual production environment.
+- **Predecessor task**: Selecting a predecessor task for the current task will 
set the selected predecessor task as upstream of the current task.
+
+## Task Example
+
+This example demonstrates importing data from Hive into MySQL.
+
+### Configuring the DataX environment in DolphinScheduler
+
+If you are using the DataX task type in a production environment, it is 
necessary to configure the required environment first. The configuration file 
is as follows: `/dolphinscheduler/conf/env/dolphinscheduler_env.sh`.
+
+![datax_task01](/img/tasks/demo/datax_task01.png)
+
+After the environment has been configured, DolphinScheduler needs to be 
restarted.
+
+### Configuring DataX Task Node
+
+As the default data source does not contain data to be read from Hive, a 
custom json is required, refer to: [HDFS 
Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md).
 Note: Partition directories exist on the HDFS path, when importing data in 
real world situations, partitioning is recommended to be passed as a parameter, 
using custom parameters.
+
+After writing the required json file, you can configure the node content by 
following the steps in the diagram below.
+
+![datax_task02](/img/tasks/demo/datax_task02.png)
+
+### View run results
+
+![datax_task03](/img/tasks/demo/datax_task03.png)
+
+### Notice
+
+If the default data source provided does not meet your needs, you can 
configure the writer and reader of DataX according to the actual usage 
environment in the custom template option, available at 
https://github.com/alibaba/DataX.
diff --git a/docs/docs/zh/guide/task/datax.md b/docs/docs/zh/guide/task/datax.md
index 5a9e167980..fa1d62a42c 100644
--- a/docs/docs/zh/guide/task/datax.md
+++ b/docs/docs/zh/guide/task/datax.md
@@ -1,59 +1,63 @@
-# DATAX 节点
-
-## 综述
-
-DataX 任务类型，用于执行 DataX 程序。对于 DataX 节点，worker 会通过执行 `${DATAX_HOME}/bin/datax.py` 
来解析传入的 json 文件。
-
-## 创建任务
-
-- 点击项目管理 -> 项目名称 -> 工作流定义，点击“创建工作流”按钮，进入 DAG 编辑页面；
-- 拖动工具栏的<img src="/img/tasks/icons/datax.png" width="15"/> 任务节点到画板中。
-
-## 任务参数
-
-- 节点名称：设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
-- 运行标志：标识这个结点是否能正常调度，如果不需要执行，可以打开禁止执行开关。
-- 描述：描述该节点的功能。
-- 任务优先级：worker 线程数不足时，根据优先级从高到低依次执行，优先级一样时根据先进先出原则执行。
-- Worker 分组：任务分配给 worker 组的机器执行，选择 Default ，会随机选择一台 worker 机执行。
-- 环境名称：配置运行脚本的环境。
-- 失败重试次数：任务失败重新提交的次数。
-- 失败重试间隔：任务失败重新提交任务的时间间隔，以分为单位。
-- 延时执行时间：任务延迟执行的时间，以分为单位。
-- 超时警告：勾选超时警告、超时失败，当任务超过“超时时长”后，会发送告警邮件并且任务执行失败。
-- 自定义模板：当默认提供的数据源不满足所需要求的时，可自定义 datax 节点的 json 配置文件内容。
-- json：DataX 同步的 json 配置文件。
-- 自定义参数：sql 
任务类型，而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换 sql 语句中 
${变量}。
-- 数据源：选择抽取数据的数据源。
-- sql 语句：目标库抽取数据的 sql 语句，节点执行时自动解析 sql 
查询列名，映射为目标表同步列名，源表和目标表列名不一致时，可以通过列别名（as）转换。
-- 目标库：选择数据同步的目标库。
-- 目标库前置 sql：前置 sql 在 sql 语句之前执行（目标库执行）。
-- 目标库后置 sql：后置 sql 在 sql 语句之后执行（目标库执行）。
-- 限流（字节数）：限制查询的字节数。
-- 限流（记录数）：限制查询的记录数。
-- 运行内存：可根据实际生产环境配置所需的最小和最大内存。
-- 前置任务：选择当前任务的前置任务，会将被选择的前置任务设置为当前任务的上游。
-
-## 任务样例
-
-该样例演示为从 Hive 数据导入到 MySQL 中。
-
-### 在 DolphinScheduler 中配置 DataX 环境
-
-若生产环境中要是使用到 DataX 任务类型，则需要先配置好所需的环境。配置文件如下：`bin/env/dolphinscheduler_env.sh`。
-
-![datax_task01](/img/tasks/demo/datax_task01.png)
-
-  <p align="center">
-   <img src="/img/datax_edit.png" width="80%" />
-  </p>
-
-- 自定义模板：打开自定义模板开关时，可以自定义datax节点的json配置文件内容（适用于控件配置不满足需求时）
-- 数据源：选择抽取数据的数据源
-- sql语句：目标库抽取数据的sql语句，节点执行时自动解析sql查询列名，映射为目标表同步列名，源表和目标表列名不一致时，可以通过列别名（as）转换
-- 目标库：选择数据同步的目标库
-- 目标表：数据同步的目标表名
-- 前置sql:前置sql在sql语句之前执行（目标库执行）。
-- 后置sql:后置sql在sql语句之后执行（目标库执行）。
-- json：datax同步的json配置文件
-- 
自定义参数：SQL任务类型，而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换sql语句中${变量}。
\ No newline at end of file
+# DATAX 节点
+
+## 综述
+
+DataX 任务类型，用于执行 DataX 程序。对于 DataX 节点，worker 会通过执行 `${DATAX_HOME}/bin/datax.py` 
来解析传入的 json 文件。
+
+## 创建任务
+
+- 点击项目管理 -> 项目名称 -> 工作流定义，点击“创建工作流”按钮，进入 DAG 编辑页面；
+- 拖动工具栏的<img src="/img/tasks/icons/datax.png" width="15"/> 任务节点到画板中。
+
+## 任务参数
+
+- 节点名称：设置任务节点的名称。一个工作流定义中的节点名称是唯一的。
+- 运行标志：标识这个结点是否能正常调度，如果不需要执行，可以打开禁止执行开关。
+- 描述：描述该节点的功能。
+- 任务优先级：worker 线程数不足时，根据优先级从高到低依次执行，优先级一样时根据先进先出原则执行。
+- Worker 分组：任务分配给 worker 组的机器执行，选择 Default ，会随机选择一台 worker 机执行。
+- 环境名称：配置运行脚本的环境。
+- 失败重试次数：任务失败重新提交的次数。
+- 失败重试间隔：任务失败重新提交任务的时间间隔，以分为单位。
+- 延时执行时间：任务延迟执行的时间，以分为单位。
+- 超时警告：勾选超时警告、超时失败，当任务超过“超时时长”后，会发送告警邮件并且任务执行失败。
+- 自定义模板：当默认提供的数据源不满足所需要求的时，可自定义 datax 节点的 json 配置文件内容。
+- json：DataX 同步的 json 配置文件。
+- 自定义参数：sql 
任务类型，而存储过程是自定义参数顺序的给方法设置值自定义参数类型和数据类型同存储过程任务类型一样。区别在于SQL任务类型自定义参数会替换 sql 语句中 
${变量}。
+- 数据源：选择抽取数据的数据源。
+- sql 语句：目标库抽取数据的 sql 语句，节点执行时自动解析 sql 
查询列名，映射为目标表同步列名，源表和目标表列名不一致时，可以通过列别名（as）转换。
+- 目标库：选择数据同步的目标库。
+- 目标库前置 sql：前置 sql 在 sql 语句之前执行（目标库执行）。
+- 目标库后置 sql：后置 sql 在 sql 语句之后执行（目标库执行）。
+- 限流（字节数）：限制查询的字节数。
+- 限流（记录数）：限制查询的记录数。
+- 运行内存：可根据实际生产环境配置所需的最小和最大内存。
+- 前置任务：选择当前任务的前置任务，会将被选择的前置任务设置为当前任务的上游。
+
+## 任务样例
+
+该样例演示为从 Hive 数据导入到 MySQL 中。
+
+### 在 DolphinScheduler 中配置 DataX 环境
+
+若生产环境中要是使用到 DataX 
任务类型，则需要先配置好所需的环境。配置文件如下：`/dolphinscheduler/conf/env/dolphinscheduler_env.sh`。
+
+![datax_task01](/img/tasks/demo/datax_task01.png)
+
+当环境配置完成之后，需要重启 DolphinScheduler。
+
+### 配置 DataX 任务节点
+
+由于默认的的数据源中并不包含从 Hive 中读取数据，所以需要自定义 json，可参考：[HDFS 
Writer](https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md)。其中需要注意的是
 HDFS 路径上存在分区目录，在实际情况导入数据时，分区建议进行传参，即使用自定义参数。
+
+在编写好所需的 json 之后，可按照下图步骤进行配置节点内容。
+
+![datax_task02](/img/tasks/demo/datax_task02.png)
+
+### 查看运行结果
+
+![datax_task03](/img/tasks/demo/datax_task03.png)
+
+## 注意事项：
+
+若默认提供的数据源不满足需求，可在自定义模板选项中，根据实际使用环境来配置 DataX 的 writer 和 
reader，可参考：https://github.com/alibaba/DataX
\ No newline at end of file
diff --git a/docs/img/datax_edit.png b/docs/img/datax_edit.png
deleted file mode 100644
index fbda73419d..0000000000
Binary files a/docs/img/datax_edit.png and /dev/null differ

[dolphinscheduler] branch dev updated: [doc] Update task DataX document (#10218)

Reply via email to