This is an automated email from the ASF dual-hosted git repository.
nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 9f5e8cc Fixing README for hudi test suite long running job (#2578)
9f5e8cc is described below
commit 9f5e8cc7c3789fd3658f52f960fb5cbc8e4efce9
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Thu Feb 25 19:50:18 2021 -0500
Fixing README for hudi test suite long running job (#2578)
---
hudi-integ-test/README.md | 92 +++++++++++++++++++++++++++++------------------
1 file changed, 58 insertions(+), 34 deletions(-)
diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md
index ff64ed1..06de263 100644
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -270,20 +270,31 @@ spark-submit \
--compact-scheduling-minshare 1
```
-For long running test suite, validation has to be done differently. Idea is to
run same dag in a repeated manner.
-Hence "ValidateDatasetNode" is introduced which will read entire input data
and compare it with hudi contents both via
-spark datasource and hive table via spark sql engine.
+## Running long running test suite in Local Docker environment
+
+For long running test suite, validation has to be done differently. Idea is to
run same dag in a repeated manner for
+N iterations. Hence "ValidateDatasetNode" is introduced which will read entire
input data and compare it with hudi
+contents both via spark datasource and hive table via spark sql engine. Hive
validation is configurable.
If you have "ValidateDatasetNode" in your dag, do not replace hive jars as
instructed above. Spark sql engine does not
-go well w/ hive2* jars. So, after running docker setup, just copy
test.properties and your dag of interest and you are
-good to go ahead.
+go well w/ hive2* jars. So, after running docker setup, follow the below
steps.
+```
+docker cp
packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar
adhoc-2:/opt/
+docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
+```
+Also copy your dag of interest to adhoc-2:/opt/
+```
+docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
+```
For repeated runs, two additional configs need to be set. "dag_rounds" and
"dag_intermittent_delay_mins".
-This means that your dag will be repeated for N times w/ a delay of Y mins
between each round.
+This means that your dag will be repeated for N times w/ a delay of Y mins
between each round. Note: complex-dag-cow.yaml
+already has all these configs set. So no changes required just to try it out.
+
+Also, ValidateDatasetNode can be configured in two ways. Either with
"delete_input_data" set to true or without
+setting the config. When "delete_input_data" is set for ValidateDatasetNode,
once validation is complete, entire input
+data will be deleted. So, suggestion is to use this ValidateDatasetNode as the
last node in the dag with "delete_input_data".
-Also, ValidateDatasetNode can be configured in two ways. Either with
"delete_input_data: true" set or not set.
-When "delete_input_data" is set for ValidateDatasetNode, once validation is
complete, entire input data will be deleted.
-So, suggestion is to use this ValidateDatasetNode as the last node in the dag
with "delete_input_data".
Example dag:
```
Insert
@@ -294,7 +305,7 @@ Example dag:
If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" =
10, then this dag will run for 10 times
with 10 mins delay between every run. At the end of every run, records written
as part of this round will be validated.
At the end of each validation, all contents of input are deleted.
-For eg: incase of above dag,
+To illustrate each round
```
Round1:
insert => inputPath/batch1
@@ -323,6 +334,12 @@ every cycle.
Lets see an example where you don't set "delete_input_data" as part of
Validation.
```
+ Insert
+ Upsert
+ ValidateDatasetNode
+```
+Here is the illustration of each round
+```
Round1:
insert => inputPath/batch1
upsert -> inputPath/batch2
@@ -382,27 +399,14 @@ Above dag was just an example for illustration purposes.
But you can make it com
Validate w/o deleting
Upsert
Validate w/ deletion
-```
-With this dag, you can set the two additional configs "dag_rounds" and
"dag_intermittent_delay_mins" and have a long
-running test suite.
-
```
-dag_rounds: 1
-dag_intermittent_delay_mins: 10
-dag_content:
- Insert
- Upsert
- Delete
- Validate w/o deleting
- Insert
- Rollback
- Validate w/o deleting
- Upsert
- Validate w/ deletion
+Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/,
you can run the following command to execute
+the test suite job.
```
-
-Sample COW command with repeated runs.
+docker exec -it adhoc-2 /bin/bash
+```
+Sample COW command
```
spark-submit \
--packages org.apache.spark:spark-avro_2.11:2.4.0 \
@@ -424,25 +428,45 @@ spark-submit \
--conf spark.driver.extraClassPath=/var/demo/jars/* \
--conf spark.executor.extraClassPath=/var/demo/jars/* \
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
-/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
+/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
--source-ordering-field test_suite_source_ordering_field \
--use-deltastreamer \
--target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
--input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
--target-table table1 \
--props test.properties \
---schemaprovider-class
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+--schemaprovider-class
org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
--source-class org.apache.hudi.utilities.sources.AvroDFSSource \
--input-file-size 125829120 \
---workload-yaml-path
file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
+--workload-yaml-path file:/opt/complex-dag-cow.yaml \
--workload-generator-classname
org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
--table-type COPY_ON_WRITE \
---compact-scheduling-minshare 1
+--compact-scheduling-minshare 1 \
+--clean-input
+--clean-output
```
-A ready to use dag is available under docker/demo/config/test-suite/ that
could give you an idea for long running
+Few ready to use dags are available under docker/demo/config/test-suite/ that
could give you an idea for long running
dags.
-cow-per-round-mixed-validate.yaml
+```
+complex-dag-cow.yaml: simple 1 round dag for COW table.
+complex-dag-mor.yaml: simple 1 round dag for MOR table.
+cow-clustering-example.yaml : dag with 3 rounds, in which inline clustering
will trigger during 2nd iteration.
+cow-long-running-example.yaml : long running dag with 50 iterations. only 1
partition is used.
+cow-long-running-multi-partitions.yaml: long running dag wit 50 iterations
with multiple partitions.
+```
+
+To run test suite jobs for MOR table, pretty much any of these dags can be
used as is. Only change is with the
+spark-shell commnad, you need to fix the table type.
+```
+--table-type MERGE_ON_READ
+```
+But if you had to switch from one table type to other, ensure you clean up all
test paths explicitly before switching to
+a different table type.
+```
+hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/output/
+hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/input/
+```
As of now, "ValidateDatasetNode" uses spark data source and hive tables for
comparison. Hence COW and real time view in
MOR can be tested.