[hudi] branch master updated: Fixing README for hudi test suite long running job (#2578)

nagarwal Thu, 25 Feb 2021 16:50:42 -0800

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/master by this push:
     new 9f5e8cc  Fixing README for hudi test suite long running job (#2578)
9f5e8cc is described below

commit 9f5e8cc7c3789fd3658f52f960fb5cbc8e4efce9
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Thu Feb 25 19:50:18 2021 -0500

    Fixing README for hudi test suite long running job (#2578)
---
 hudi-integ-test/README.md | 92 +++++++++++++++++++++++++++++------------------
 1 file changed, 58 insertions(+), 34 deletions(-)

diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md
index ff64ed1..06de263 100644
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -270,20 +270,31 @@ spark-submit \
 --compact-scheduling-minshare 1
 ``` 
 
-For long running test suite, validation has to be done differently. Idea is to 
run same dag in a repeated manner. 
-Hence "ValidateDatasetNode" is introduced which will read entire input data 
and compare it with hudi contents both via 
-spark datasource and hive table via spark sql engine.
+## Running long running test suite in Local Docker environment
+
+For long running test suite, validation has to be done differently. Idea is to 
run same dag in a repeated manner for 
+N iterations. Hence "ValidateDatasetNode" is introduced which will read entire 
input data and compare it with hudi 
+contents both via spark datasource and hive table via spark sql engine. Hive 
validation is configurable. 
 
 If you have "ValidateDatasetNode" in your dag, do not replace hive jars as 
instructed above. Spark sql engine does not 
-go well w/ hive2* jars. So, after running docker setup, just copy 
test.properties and your dag of interest and you are 
-good to go ahead. 
+go well w/ hive2* jars. So, after running docker setup, follow the below 
steps. 
+```
+docker cp 
packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar
 adhoc-2:/opt/
+docker cp demo/config/test-suite/test.properties adhoc-2:/opt/
+```
+Also copy your dag of interest to adhoc-2:/opt/
+```
+docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/
+```
 
 For repeated runs, two additional configs need to be set. "dag_rounds" and 
"dag_intermittent_delay_mins". 
-This means that your dag will be repeated for N times w/ a delay of Y mins 
between each round.
+This means that your dag will be repeated for N times w/ a delay of Y mins 
between each round. Note: complex-dag-cow.yaml
+already has all these configs set. So no changes required just to try it out. 
+
+Also, ValidateDatasetNode can be configured in two ways. Either with 
"delete_input_data" set to true or without 
+setting the config. When "delete_input_data" is set for ValidateDatasetNode, 
once validation is complete, entire input 
+data will be deleted. So, suggestion is to use this ValidateDatasetNode as the 
last node in the dag with "delete_input_data". 
 
-Also, ValidateDatasetNode can be configured in two ways. Either with 
"delete_input_data: true" set or not set. 
-When "delete_input_data" is set for ValidateDatasetNode, once validation is 
complete, entire input data will be deleted. 
-So, suggestion is to use this ValidateDatasetNode as the last node in the dag 
with "delete_input_data". 
 Example dag: 
 ```
      Insert
@@ -294,7 +305,7 @@ Example dag:
 If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 
10, then this dag will run for 10 times 
 with 10 mins delay between every run. At the end of every run, records written 
as part of this round will be validated. 
 At the end of each validation, all contents of input are deleted.   
-For eg: incase of above dag, 
+To illustrate each round 
 ```
 Round1: 
     insert => inputPath/batch1
@@ -323,6 +334,12 @@ every cycle.
 
 Lets see an example where you don't set "delete_input_data" as part of 
Validation. 
 ```
+     Insert
+     Upsert
+     ValidateDatasetNode 
+```
+Here is the illustration of each round
+```
 Round1: 
     insert => inputPath/batch1
     upsert -> inputPath/batch2
@@ -382,27 +399,14 @@ Above dag was just an example for illustration purposes. 
But you can make it com
     Validate w/o deleting
     Upsert
     Validate w/ deletion
-```                        
-With this dag, you can set the two additional configs "dag_rounds" and 
"dag_intermittent_delay_mins" and have a long 
-running test suite. 
-
 ```
-dag_rounds: 1
-dag_intermittent_delay_mins: 10
-dag_content:
-    Insert
-    Upsert
-    Delete
-    Validate w/o deleting
-    Insert
-    Rollback
-    Validate w/o deleting
-    Upsert
-    Validate w/ deletion
 
+Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/, 
you can run the following command to execute 
+the test suite job. 
 ```
-
-Sample COW command with repeated runs. 
+docker exec -it adhoc-2 /bin/bash
+```
+Sample COW command
 ```
 spark-submit \
 --packages org.apache.spark:spark-avro_2.11:2.4.0 \
@@ -424,25 +428,45 @@ spark-submit \
 --conf spark.driver.extraClassPath=/var/demo/jars/* \
 --conf spark.executor.extraClassPath=/var/demo/jars/* \
 --class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob \
-/opt/hudi-integ-test-bundle-0.6.1-SNAPSHOT.jar \
+/opt/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar \
 --source-ordering-field test_suite_source_ordering_field \
 --use-deltastreamer \
 --target-base-path /user/hive/warehouse/hudi-integ-test-suite/output \
 --input-base-path /user/hive/warehouse/hudi-integ-test-suite/input \
 --target-table table1 \
 --props test.properties \
---schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+--schemaprovider-class 
org.apache.hudi.integ.testsuite.schema.TestSuiteFileBasedSchemaProvider \
 --source-class org.apache.hudi.utilities.sources.AvroDFSSource \
 --input-file-size 125829120 \
---workload-yaml-path 
file:/var/hoodie/ws/docker/demo/config/test-suite/complex-dag-cow.yaml \
+--workload-yaml-path file:/opt/complex-dag-cow.yaml \
 --workload-generator-classname 
org.apache.hudi.integ.testsuite.dag.WorkflowDagGenerator \
 --table-type COPY_ON_WRITE \
---compact-scheduling-minshare 1 
+--compact-scheduling-minshare 1 \
+--clean-input
+--clean-output
 ```
 
-A ready to use dag is available under docker/demo/config/test-suite/ that 
could give you an idea for long running 
+Few ready to use dags are available under docker/demo/config/test-suite/ that 
could give you an idea for long running 
 dags.
-cow-per-round-mixed-validate.yaml
+```
+complex-dag-cow.yaml: simple 1 round dag for COW table.
+complex-dag-mor.yaml: simple 1 round dag for MOR table.
+cow-clustering-example.yaml : dag with 3 rounds, in which inline clustering 
will trigger during 2nd iteration. 
+cow-long-running-example.yaml : long running dag with 50 iterations. only 1 
partition is used. 
+cow-long-running-multi-partitions.yaml: long running dag wit 50 iterations 
with multiple partitions.
+```
+
+To run test suite jobs for MOR table, pretty much any of these dags can be 
used as is. Only change is with the 
+spark-shell commnad, you need to fix the table type. 
+```
+--table-type MERGE_ON_READ
+```
+But if you had to switch from one table type to other, ensure you clean up all 
test paths explicitly before switching to
+a different table type.
+```
+hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/output/
+hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/input/
+```
 
 As of now, "ValidateDatasetNode" uses spark data source and hive tables for 
comparison. Hence COW and real time view in 
 MOR can be tested.

[hudi] branch master updated: Fixing README for hudi test suite long running job (#2578)

Reply via email to