zhongjiajie commented on code in PR #9512:
URL: https://github.com/apache/dolphinscheduler/pull/9512#discussion_r854811336
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
Review Comment:
```suggestion
Add config : `<server-name>/conf/common.properties`
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
Review Comment:
```suggestion
```properties
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
```
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested,
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it
indicates that the data does not meet expectations, and the failure strategy is
executed.
+- CheckFormula:
+ - Expected-Actual
+ - Actual-Expected
+ - (Actual/Expected)x100%
+ - (Expected-Actual)/Expected x100%
+- Operator:=、>、>=、<、<=、!=
+- ExpectedValue
+ - FixValue
+ - DailyAvg
+ - WeeklyAvg
+ - MonthlyAvg
+ - Last7DayAvg
+ - Last30DayAvg
+ - SrcTableTotalRows
+ - TargetTableTotalRows
+
+- eg
+ - CheckFormula:Expected-Actual
+ - Operator:>
+ - Threshold:0
+ - ExpectedValue:FixValue=9。
+
+Assuming that the actual value is 10, the operator is >, and the expected
value is 9, then the result 10 -9 > 0 is true, which means that the row data in
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the
specified column. The number of empty rows can be compared with the total
number of rows or a specified threshold. If it is greater than a certain
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+ - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
+- The SQL to calculate the total number of rows in the table is as follows:
+ - SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
Review Comment:
```suggestion
```sql
SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
```
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
Review Comment:
```suggestion
`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.
```
```suggestion
`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested,
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it
indicates that the data does not meet expectations, and the failure strategy is
executed.
+- CheckFormula:
+ - Expected-Actual
+ - Actual-Expected
+ - (Actual/Expected)x100%
+ - (Expected-Actual)/Expected x100%
+- Operator:=、>、>=、<、<=、!=
+- ExpectedValue
+ - FixValue
+ - DailyAvg
+ - WeeklyAvg
+ - MonthlyAvg
+ - Last7DayAvg
+ - Last30DayAvg
+ - SrcTableTotalRows
+ - TargetTableTotalRows
+
+- eg
+ - CheckFormula:Expected-Actual
+ - Operator:>
+ - Threshold:0
+ - ExpectedValue:FixValue=9。
+
+Assuming that the actual value is 10, the operator is >, and the expected
value is 9, then the result 10 -9 > 0 is true, which means that the row data in
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the
specified column. The number of empty rows can be compared with the total
number of rows or a specified threshold. If it is greater than a certain
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+ - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
Review Comment:
```suggestion
```sql
SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
```
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested,
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it
indicates that the data does not meet expectations, and the failure strategy is
executed.
+- CheckFormula:
+ - Expected-Actual
+ - Actual-Expected
+ - (Actual/Expected)x100%
+ - (Expected-Actual)/Expected x100%
+- Operator:=、>、>=、<、<=、!=
+- ExpectedValue
+ - FixValue
+ - DailyAvg
+ - WeeklyAvg
+ - MonthlyAvg
+ - Last7DayAvg
+ - Last30DayAvg
+ - SrcTableTotalRows
+ - TargetTableTotalRows
+
+- eg
+ - CheckFormula:Expected-Actual
+ - Operator:>
+ - Threshold:0
+ - ExpectedValue:FixValue=9。
+
+Assuming that the actual value is 10, the operator is >, and the expected
value is 9, then the result 10 -9 > 0 is true, which means that the row data in
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the
specified column. The number of empty rows can be compared with the total
number of rows or a specified threshold. If it is greater than a certain
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+ - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
+- The SQL to calculate the total number of rows in the table is as follows:
+ - SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
Review Comment:
```suggestion
```sql
SELECT COUNT(*) AS total FROM ${src_table} WHERE (${src_filter})
```
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested,
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it
indicates that the data does not meet expectations, and the failure strategy is
executed.
+- CheckFormula:
+ - Expected-Actual
+ - Actual-Expected
+ - (Actual/Expected)x100%
+ - (Expected-Actual)/Expected x100%
+- Operator:=、>、>=、<、<=、!=
+- ExpectedValue
+ - FixValue
+ - DailyAvg
+ - WeeklyAvg
+ - MonthlyAvg
+ - Last7DayAvg
+ - Last30DayAvg
+ - SrcTableTotalRows
+ - TargetTableTotalRows
+
+- eg
+ - CheckFormula:Expected-Actual
+ - Operator:>
+ - Threshold:0
+ - ExpectedValue:FixValue=9。
+
+Assuming that the actual value is 10, the operator is >, and the expected
value is 9, then the result 10 -9 > 0 is true, which means that the row data in
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the
specified column. The number of empty rows can be compared with the total
number of rows or a specified threshold. If it is greater than a certain
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+ - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
Review Comment:
```suggestion
```sql
SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
```
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
Review Comment:
```suggestion
`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.
```
```suggestion
`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
Review Comment:
```suggestion
```properties
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
```
```
##########
docs/docs/en/guide/task/data-quality.md:
##########
@@ -0,0 +1,299 @@
+# 1 Overview
+## 1.1 Introduction
+
+The data quality task is used to check the data accuracy during the
integration and processing of data. Data quality tasks in this release include
single-table checking, single-table custom SQL checking, multi-table accuracy,
and two-table value comparisons. The running environment of the data quality
task is Spark 2.4.0, and other versions have not been verified, and users can
verify by themselves.
+- The execution flow of the data quality task is as follows:
+
+> The user defines the task in the interface, and the user input value is
stored in `TaskParam`
+When running a task, `Master` will parse `TaskParam`, encapsulate the
parameters required by `DataQualityTask` and send it to `Worker`.
+Worker runs the data quality task. After the data quality task finishes
running, it writes the statistical results to the specified storage engine. The
current data quality task result is stored in the `t_ds_dq_execute_result`
table of `dolphinscheduler`
+`Worker` sends the task result to `Master`, after `Master` receives
`TaskResponse`, it will judge whether the task type is `DataQualityTask`, if
so, it will read the corresponding result from `t_ds_dq_execute_result`
according to `taskInstanceId`, and then The result is judged according to the
check mode, operator and threshold configured by the user. If the result is a
failure, the corresponding operation, alarm or interruption will be performed
according to the failure policy configured by the user.## 1.2 注意事项
+
+Add config : common.properties
+> data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar
+
+Please fill in `data-quality.jar.name` according to the actual package name,
+If you package `data-quality` separately, remember to modify the package name
to be consistent with `data-quality.jar.name`.
+If the old version is upgraded and used, you need to execute the `sql` update
script to initialize the database before running.
+If you want to use `MySQL` data, you need to comment out the `scope` of
`MySQL` in `pom.xml`
+Currently only `MySQL`, `PostgreSQL` and `HIVE` data sources have been tested,
other data sources have not been tested yet
+`Spark` needs to be configured to read `Hive` metadata, `Spark` does not use
`jdbc` to read `Hive`
+
+## 1.3 Detail
+
+- CheckMethod: [CheckFormula][Operator][Threshold], if the result is true, it
indicates that the data does not meet expectations, and the failure strategy is
executed.
+- CheckFormula:
+ - Expected-Actual
+ - Actual-Expected
+ - (Actual/Expected)x100%
+ - (Expected-Actual)/Expected x100%
+- Operator:=、>、>=、<、<=、!=
+- ExpectedValue
+ - FixValue
+ - DailyAvg
+ - WeeklyAvg
+ - MonthlyAvg
+ - Last7DayAvg
+ - Last30DayAvg
+ - SrcTableTotalRows
+ - TargetTableTotalRows
+
+- eg
+ - CheckFormula:Expected-Actual
+ - Operator:>
+ - Threshold:0
+ - ExpectedValue:FixValue=9。
+
+Assuming that the actual value is 10, the operator is >, and the expected
value is 9, then the result 10 -9 > 0 is true, which means that the row data in
the empty column has exceeded the threshold, and the task is judged to fail
+# 2 Guide
+## 2.1 NullCheck
+### 2.1.1 Introduction
+The goal of the null value check is to check the number of empty rows in the
specified column. The number of empty rows can be compared with the total
number of rows or a specified threshold. If it is greater than a certain
threshold, it will be judged as failure.
+- Calculate the SQL statement that the specified column is empty as follows:
+ - SELECT COUNT(*) AS miss FROM ${src_table} WHERE (${src_field} is null or
${src_field} = '') AND (${src_filter})
Review Comment:
should use sql syntax
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]