[GitHub] [dolphinscheduler] zhongjiajie commented on a diff in pull request #9512: [Docs][DataQuality]: Add DataQuality Docs

GitBox Wed, 20 Apr 2022 23:12:31 -0700


zhongjiajie commented on code in PR #9512:
URL: https://github.com/apache/dolphinscheduler/pull/9512#discussion_r854811336



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties

Review Comment:
   ```suggestion
   Add config : `<server-name>/conf/common.properties`
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar

Review Comment:
   ```suggestion
   
   ```properties
   data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
   ```
   ```
   



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name 
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update 
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of 
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, 
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use 
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it 
indicates that the data does not meet expectations, and the failure strategy is 
executed.
+- CheckFormula：
+    - Expected-Actual
+    - Actual-Expected
+    - (Actual/Expected)x100%
+    - (Expected-Actual)/Expected x100%
+- Operator：=、>、>=、<、<=、!=
+- ExpectedValue
+    - FixValue
+    - DailyAvg
+    - WeeklyAvg
+    - MonthlyAvg
+    - Last7DayAvg
+    - Last30DayAvg
+    - SrcTableTotalRows
+    - TargetTableTotalRows
+    
+- eg
+    - CheckFormula：Expected-Actual
+    - Operator：>
+    - Threshold：0
+    - ExpectedValue：FixValue=9。
+    
+Assuming that the actual value is 10, the operator is >, and the expected 
value is 9, then the result 10 -9 > 0 is true, which means that the row data in 
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the 
specified column. The number of empty rows can be compared with the total 
number of rows or a specified threshold. If it is greater than a certain 
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+    - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})
+- The SQL to calculate the total number of rows in the table is as follows:
+    - SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})

Review Comment:
   ```suggestion
        ```sql
        SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
        ```
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项

Review Comment:
   ```suggestion
   `Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.
   ```
   ```suggestion
   `Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name 
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update 
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of 
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, 
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use 
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it 
indicates that the data does not meet expectations, and the failure strategy is 
executed.
+- CheckFormula：
+    - Expected-Actual
+    - Actual-Expected
+    - (Actual/Expected)x100%
+    - (Expected-Actual)/Expected x100%
+- Operator：=、>、>=、<、<=、!=
+- ExpectedValue
+    - FixValue
+    - DailyAvg
+    - WeeklyAvg
+    - MonthlyAvg
+    - Last7DayAvg
+    - Last30DayAvg
+    - SrcTableTotalRows
+    - TargetTableTotalRows
+    
+- eg
+    - CheckFormula：Expected-Actual
+    - Operator：>
+    - Threshold：0
+    - ExpectedValue：FixValue=9。
+    
+Assuming that the actual value is 10, the operator is >, and the expected 
value is 9, then the result 10 -9 > 0 is true, which means that the row data in 
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the 
specified column. The number of empty rows can be compared with the total 
number of rows or a specified threshold. If it is greater than a certain 
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+    - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})

Review Comment:
   ```suggestion
   ```sql
   SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})
   ```
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name 
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update 
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of 
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, 
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use 
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it 
indicates that the data does not meet expectations, and the failure strategy is 
executed.
+- CheckFormula：
+    - Expected-Actual
+    - Actual-Expected
+    - (Actual/Expected)x100%
+    - (Expected-Actual)/Expected x100%
+- Operator：=、>、>=、<、<=、!=
+- ExpectedValue
+    - FixValue
+    - DailyAvg
+    - WeeklyAvg
+    - MonthlyAvg
+    - Last7DayAvg
+    - Last30DayAvg
+    - SrcTableTotalRows
+    - TargetTableTotalRows
+    
+- eg
+    - CheckFormula：Expected-Actual
+    - Operator：>
+    - Threshold：0
+    - ExpectedValue：FixValue=9。
+    
+Assuming that the actual value is 10, the operator is >, and the expected 
value is 9, then the result 10 -9 > 0 is true, which means that the row data in 
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the 
specified column. The number of empty rows can be compared with the total 
number of rows or a specified threshold. If it is greater than a certain 
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+    - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})
+- The SQL to calculate the total number of rows in the table is as follows:
+    - SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})

Review Comment:
   ```suggestion
        ```sql
        SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
        ```
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name 
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update 
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of 
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, 
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use 
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it 
indicates that the data does not meet expectations, and the failure strategy is 
executed.
+- CheckFormula：
+    - Expected-Actual
+    - Actual-Expected
+    - (Actual/Expected)x100%
+    - (Expected-Actual)/Expected x100%
+- Operator：=、>、>=、<、<=、!=
+- ExpectedValue
+    - FixValue
+    - DailyAvg
+    - WeeklyAvg
+    - MonthlyAvg
+    - Last7DayAvg
+    - Last30DayAvg
+    - SrcTableTotalRows
+    - TargetTableTotalRows
+    
+- eg
+    - CheckFormula：Expected-Actual
+    - Operator：>
+    - Threshold：0
+    - ExpectedValue：FixValue=9。
+    
+Assuming that the actual value is 10, the operator is >, and the expected 
value is 9, then the result 10 -9 > 0 is true, which means that the row data in 
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the 
specified column. The number of empty rows can be compared with the total 
number of rows or a specified threshold. If it is greater than a certain 
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+    - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})

Review Comment:
   ```suggestion
       ```sql
       SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})
       ```
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项

Review Comment:
   ```suggestion
   `Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.
   ```
   ```suggestion
   `Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
   ```



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar

Review Comment:
   ```suggestion
   
   ```properties
   data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
   ```
   ```
   



##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the 
integration and processing of data. Data quality tasks in this release include 
single-table checking, single-table custom SQL checking, multi-table accuracy, 
and two-table value comparisons. The running environment of the data quality 
task is Spark 2.4.0, and other versions have not been verified, and users can 
verify by themselves.
+- The execution flow of the data quality task is as follows: 
+
+> The user defines the task in the interface, and the user input value is 
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the 
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes 
running, it writes the statistical results to the specified storage engine. The 
current data quality task result is stored in the `t_ds_dq_execute_result` 
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives 
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if 
so, it will read the corresponding result from `t_ds_dq_execute_result` 
according to `taskInstanceId`, and then The result is judged according to the 
check mode, operator and threshold configured by the user. If the result is a 
failure, the corresponding operation, alarm or interruption will be performed 
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name 
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update 
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of 
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested, 
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use 
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it 
indicates that the data does not meet expectations, and the failure strategy is 
executed.
+- CheckFormula：
+    - Expected-Actual
+    - Actual-Expected
+    - (Actual/Expected)x100%
+    - (Expected-Actual)/Expected x100%
+- Operator：=、>、>=、<、<=、!=
+- ExpectedValue
+    - FixValue
+    - DailyAvg
+    - WeeklyAvg
+    - MonthlyAvg
+    - Last7DayAvg
+    - Last30DayAvg
+    - SrcTableTotalRows
+    - TargetTableTotalRows
+    
+- eg
+    - CheckFormula：Expected-Actual
+    - Operator：>
+    - Threshold：0
+    - ExpectedValue：FixValue=9。
+    
+Assuming that the actual value is 10, the operator is >, and the expected 
value is 9, then the result 10 -9 > 0 is true, which means that the row data in 
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the 
specified column. The number of empty rows can be compared with the total 
number of rows or a specified threshold. If it is greater than a certain 
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+    - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or 
${src_field} = '') AND (${src_filter})

Review Comment:
   should use sql syntax



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] zhongjiajie commented on a diff in pull request #9512: [Docs][DataQuality]: Add DataQuality Docs

Reply via email to