This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 7141e68c64a [DOCS] Update data quality docs (#8552)
7141e68c64a is described below

commit 7141e68c64a996cb87085d09f4eeb32f3f0df08e
Author: Andy Walner <[email protected]>
AuthorDate: Fri Jun 23 16:33:30 2023 -0400

    [DOCS] Update data quality docs (#8552)
    
    Co-authored-by: Y Ethan Guo <[email protected]>
---
 website/docs/precommit_validator.md                | 52 +++++++++++++++-------
 .../version-0.13.0/precommit_validator.md          | 52 +++++++++++++++-------
 .../version-0.13.1/precommit_validator.md          | 52 +++++++++++++++-------
 3 files changed, 111 insertions(+), 45 deletions(-)

diff --git a/website/docs/precommit_validator.md 
b/website/docs/precommit_validator.md
index c311c0b01b8..f7466002f89 100644
--- a/website/docs/precommit_validator.md
+++ b/website/docs/precommit_validator.md
@@ -3,8 +3,9 @@ title: Data Quality
 keywords: [ hudi, quality, expectations, pre-commit validator]
 ---
 
-Apache Hudi has what are called **Pre-Commit Validators** that allow you to 
validate that your data meets certain data quality
-expectations as you are writing with DeltaStreamer or Spark Datasource writers.
+Data quality refers to the overall accuracy, completeness, consistency, and 
validity of data. Ensuring data quality is vital for accurate analysis and 
reporting, as well as for compliance with regulations and maintaining trust in 
your organization's data infrastructure.
+
+Hudi offers **Pre-Commit Validators** that allow you to ensure that your data 
meets certain data quality expectations as you are writing with DeltaStreamer 
or Spark Datasource writers.
 
 To configure pre-commit validators, use this setting 
`hoodie.precommit.validators=<comma separated list of validator class names>`.
 
@@ -17,49 +18,65 @@ spark.write.format("hudi")
 Today you can use any of these validators and even have the flexibility to 
extend your own:
 
 ## SQL Query Single Result
-Can be used to validate that a query on the table results in a specific value.
-- 
[org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java)
+
+The SQL Query Single Result validator can be used to validate that a query on 
the table results in a specific value. This validator allows you to run a SQL 
query and abort the commit if it does not match the expected output.
 
-Multiple queries separated by ';' delimiter are supported.Expected result is 
included as part of query separated by '#'. Example query: 
`query1#result1;query2#result2`
+Multiple queries can be separated by `;` delimiter. Include the expected 
result as part of the query separated by `#`.
 
-Example, "expect exactly 0 null rows":
+Syntax: `query1#result1;query2#result2`
+
+Example:
 ```scala
+// In this example, we set up a validator that expects there is no row with 
`col` column as `null`
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
-  option("hoodie.precommit.validators.single.value.sql.queries", "select 
count(*) from <TABLE_NAME> where col=null#0").
+  option("hoodie.precommit.validators.single.value.sql.queries", "select 
count(*) from <TABLE_NAME> where col is null#0").
   save(basePath)
 ```
 
 ## SQL Query Equality
-Can be used to validate for equality of rows before and after the commit.
-- 
[org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java)
+
+The SQL Query Equality validator runs a query before ingesting the data, then 
runs the same query after ingesting the data and confirms that both outputs 
match. This allows you to validate for equality of rows before and after the 
commit.
+
+This validator is useful when you want to verify that your query does not 
change a specific subset of the data. Some examples:
+- Validate that the number of null fields is the same before and after your 
query
+- Validate that there are no duplicate records after your query runs
+- Validate that you are only updating the data, and no inserts slip through
 
-Example, "expect no change of null rows with this commit":
+Example:
 ```scala
+// In this example, we set up a validator that expects no change of null rows 
with the new commit
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator").
-  option("hoodie.precommit.validators.equality.sql.queries", "select count(*) 
from <TABLE_NAME> where col=null").
+  option("hoodie.precommit.validators.equality.sql.queries", "select count(*) 
from <TABLE_NAME> where col is null").
   save(basePath)
 ```
 
 ## SQL Query Inequality
-Can be used to validate for inequality of rows before and after the commit.
-- 
[org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java)
 
-Example, "expect there must be a change of null rows with this commit":
+The SQL Query Inquality validator runs a query before ingesting the data, then 
runs the same query after ingesting the data and confirms that both outputs DO 
NOT match. This allows you to confirm changes in the rows before and after the 
commit.
+
+Example:
 ```scala
+// In this example, we set up a validator that expects a change of null rows 
with the new commit
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator").
-  option("hoodie.precommit.validators.inequality.sql.queries", "select 
count(*) from <TABLE_NAME> where col=null").
+  option("hoodie.precommit.validators.inequality.sql.queries", "select 
count(*) from <TABLE_NAME> where col is null").
   save(basePath)
 ```
 
@@ -72,3 +89,8 @@ void validateRecordsBeforeAndAfter(Dataset<Row> before,
                                    Dataset<Row> after, 
                                    Set<String> partitionsAffected)
 ```
+
+## Additional Monitoring with Notifications
+Hudi offers a [commit notification 
service](https://hudi.apache.org/docs/next/writing_data/#commit-notifications) 
that can be configured to trigger notifications about write commits.
+
+The commit notification service can be combined with pre-commit validators to 
send a notification when a commit fails a validation. This is possible by 
passing details about the validation as a custom value to the HTTP endpoint.
diff --git a/website/versioned_docs/version-0.13.0/precommit_validator.md 
b/website/versioned_docs/version-0.13.0/precommit_validator.md
index c311c0b01b8..f7466002f89 100644
--- a/website/versioned_docs/version-0.13.0/precommit_validator.md
+++ b/website/versioned_docs/version-0.13.0/precommit_validator.md
@@ -3,8 +3,9 @@ title: Data Quality
 keywords: [ hudi, quality, expectations, pre-commit validator]
 ---
 
-Apache Hudi has what are called **Pre-Commit Validators** that allow you to 
validate that your data meets certain data quality
-expectations as you are writing with DeltaStreamer or Spark Datasource writers.
+Data quality refers to the overall accuracy, completeness, consistency, and 
validity of data. Ensuring data quality is vital for accurate analysis and 
reporting, as well as for compliance with regulations and maintaining trust in 
your organization's data infrastructure.
+
+Hudi offers **Pre-Commit Validators** that allow you to ensure that your data 
meets certain data quality expectations as you are writing with DeltaStreamer 
or Spark Datasource writers.
 
 To configure pre-commit validators, use this setting 
`hoodie.precommit.validators=<comma separated list of validator class names>`.
 
@@ -17,49 +18,65 @@ spark.write.format("hudi")
 Today you can use any of these validators and even have the flexibility to 
extend your own:
 
 ## SQL Query Single Result
-Can be used to validate that a query on the table results in a specific value.
-- 
[org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java)
+
+The SQL Query Single Result validator can be used to validate that a query on 
the table results in a specific value. This validator allows you to run a SQL 
query and abort the commit if it does not match the expected output.
 
-Multiple queries separated by ';' delimiter are supported.Expected result is 
included as part of query separated by '#'. Example query: 
`query1#result1;query2#result2`
+Multiple queries can be separated by `;` delimiter. Include the expected 
result as part of the query separated by `#`.
 
-Example, "expect exactly 0 null rows":
+Syntax: `query1#result1;query2#result2`
+
+Example:
 ```scala
+// In this example, we set up a validator that expects there is no row with 
`col` column as `null`
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
-  option("hoodie.precommit.validators.single.value.sql.queries", "select 
count(*) from <TABLE_NAME> where col=null#0").
+  option("hoodie.precommit.validators.single.value.sql.queries", "select 
count(*) from <TABLE_NAME> where col is null#0").
   save(basePath)
 ```
 
 ## SQL Query Equality
-Can be used to validate for equality of rows before and after the commit.
-- 
[org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java)
+
+The SQL Query Equality validator runs a query before ingesting the data, then 
runs the same query after ingesting the data and confirms that both outputs 
match. This allows you to validate for equality of rows before and after the 
commit.
+
+This validator is useful when you want to verify that your query does not 
change a specific subset of the data. Some examples:
+- Validate that the number of null fields is the same before and after your 
query
+- Validate that there are no duplicate records after your query runs
+- Validate that you are only updating the data, and no inserts slip through
 
-Example, "expect no change of null rows with this commit":
+Example:
 ```scala
+// In this example, we set up a validator that expects no change of null rows 
with the new commit
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator").
-  option("hoodie.precommit.validators.equality.sql.queries", "select count(*) 
from <TABLE_NAME> where col=null").
+  option("hoodie.precommit.validators.equality.sql.queries", "select count(*) 
from <TABLE_NAME> where col is null").
   save(basePath)
 ```
 
 ## SQL Query Inequality
-Can be used to validate for inequality of rows before and after the commit.
-- 
[org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java)
 
-Example, "expect there must be a change of null rows with this commit":
+The SQL Query Inquality validator runs a query before ingesting the data, then 
runs the same query after ingesting the data and confirms that both outputs DO 
NOT match. This allows you to confirm changes in the rows before and after the 
commit.
+
+Example:
 ```scala
+// In this example, we set up a validator that expects a change of null rows 
with the new commit
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator").
-  option("hoodie.precommit.validators.inequality.sql.queries", "select 
count(*) from <TABLE_NAME> where col=null").
+  option("hoodie.precommit.validators.inequality.sql.queries", "select 
count(*) from <TABLE_NAME> where col is null").
   save(basePath)
 ```
 
@@ -72,3 +89,8 @@ void validateRecordsBeforeAndAfter(Dataset<Row> before,
                                    Dataset<Row> after, 
                                    Set<String> partitionsAffected)
 ```
+
+## Additional Monitoring with Notifications
+Hudi offers a [commit notification 
service](https://hudi.apache.org/docs/next/writing_data/#commit-notifications) 
that can be configured to trigger notifications about write commits.
+
+The commit notification service can be combined with pre-commit validators to 
send a notification when a commit fails a validation. This is possible by 
passing details about the validation as a custom value to the HTTP endpoint.
diff --git a/website/versioned_docs/version-0.13.1/precommit_validator.md 
b/website/versioned_docs/version-0.13.1/precommit_validator.md
index c311c0b01b8..eb96e648853 100644
--- a/website/versioned_docs/version-0.13.1/precommit_validator.md
+++ b/website/versioned_docs/version-0.13.1/precommit_validator.md
@@ -3,8 +3,9 @@ title: Data Quality
 keywords: [ hudi, quality, expectations, pre-commit validator]
 ---
 
-Apache Hudi has what are called **Pre-Commit Validators** that allow you to 
validate that your data meets certain data quality
-expectations as you are writing with DeltaStreamer or Spark Datasource writers.
+Data quality refers to the overall accuracy, completeness, consistency, and 
validity of data. Ensuring data quality is vital for accurate analysis and 
reporting, as well as for compliance with regulations and maintaining trust in 
your organization's data infrastructure.
+
+Hudi offers **Pre-Commit Validators** that allow you to ensure that your data 
meets certain data quality expectations as you are writing with DeltaStreamer 
or Spark Datasource writers.
 
 To configure pre-commit validators, use this setting 
`hoodie.precommit.validators=<comma separated list of validator class names>`.
 
@@ -17,49 +18,65 @@ spark.write.format("hudi")
 Today you can use any of these validators and even have the flexibility to 
extend your own:
 
 ## SQL Query Single Result
-Can be used to validate that a query on the table results in a specific value.
-- 
[org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java)
+
+The SQL Query Single Result validator can be used to validate that a query on 
the table results in a specific value. This validator allows you to run a SQL 
query and abort the commit if it does not match the expected output.
 
-Multiple queries separated by ';' delimiter are supported.Expected result is 
included as part of query separated by '#'. Example query: 
`query1#result1;query2#result2`
+Multiple queries can be separated by `;` delimiter. Include the expected 
result as part of the query separated by `#`.
 
-Example, "expect exactly 0 null rows":
+Syntax: `query1#result1;query2#result2`
+
+Example:
 ```scala
+// In this example, we set up a validator that expects there is no row with 
`col` column as `null`
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
-  option("hoodie.precommit.validators.single.value.sql.queries", "select 
count(*) from <TABLE_NAME> where col=null#0").
+  option("hoodie.precommit.validators.single.value.sql.queries", "select 
count(*) from <TABLE_NAME> where col is null#0").
   save(basePath)
 ```
 
 ## SQL Query Equality
-Can be used to validate for equality of rows before and after the commit.
-- 
[org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java)
+
+The SQL Query Equality validator runs a query before ingesting the data, then 
runs the same query after ingesting the data and confirms that both outputs 
match. This allows you to validate for equality of rows before and after the 
commit.
+
+This validator is useful when you want to verify that your query does not 
change a specific subset of the data. Some examples:
+- Validate that the number of null fields is the same before and after your 
query
+- Validate that there are no duplicate records after your query runs
+- Validate that you are only updating the data, and no inserts slip through
 
-Example, "expect no change of null rows with this commit":
+Example:
 ```scala
+// In this example, we set up a validator that expects no change of null rows 
with the new commit
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator").
-  option("hoodie.precommit.validators.equality.sql.queries", "select count(*) 
from <TABLE_NAME> where col=null").
+  option("hoodie.precommit.validators.equality.sql.queries", "select count(*) 
from <TABLE_NAME> where col is null").
   save(basePath)
 ```
 
 ## SQL Query Inequality
-Can be used to validate for inequality of rows before and after the commit.
-- 
[org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java)
+[org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator](https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java)
 
-Example, "expect there must be a change of null rows with this commit":
+The SQL Query Inquality validator runs a query before ingesting the data, then 
runs the same query after ingesting the data and confirms that both outputs DO 
NOT match. This allows you to confirm changes in the rows before and after the 
commit.
+
+Example:
 ```scala
+// In this example, we set up a validator that expects a change of null rows 
with the new commit
+
 import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
 
 df.write.format("hudi").mode(Overwrite).
   option(TABLE_NAME, tableName).
   option("hoodie.precommit.validators", 
"org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator").
-  option("hoodie.precommit.validators.inequality.sql.queries", "select 
count(*) from <TABLE_NAME> where col=null").
+  option("hoodie.precommit.validators.inequality.sql.queries", "select 
count(*) from <TABLE_NAME> where col is null").
   save(basePath)
 ```
 
@@ -72,3 +89,8 @@ void validateRecordsBeforeAndAfter(Dataset<Row> before,
                                    Dataset<Row> after, 
                                    Set<String> partitionsAffected)
 ```
+
+## Additional Monitoring with Notifications
+Hudi offers a [commit notification 
service](https://hudi.apache.org/docs/next/writing_data/#commit-notifications) 
that can be configured to trigger notifications about write commits.
+
+The commit notification service can be combined with pre-commit validators to 
send a notification when a commit fails a validation. This is possible by 
passing details about the validation as a custom value to the HTTP endpoint.
\ No newline at end of file

Reply via email to