(hudi) branch asf-site updated: [HUDI-8296] Improve docs around Hudi Spark support and hudi-utilities-slim-bundle (#12478)

codope Thu, 12 Dec 2024 17:46:33 -0800

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new b148a8dbbbb [HUDI-8296] Improve docs around Hudi Spark support and 
hudi-utilities-slim-bundle (#12478)
b148a8dbbbb is described below

commit b148a8dbbbba7bea8ae3f44dec0d5c07c898add3
Author: Y Ethan Guo <[email protected]>
AuthorDate: Thu Dec 12 17:45:24 2024 -0800

    [HUDI-8296] Improve docs around Hudi Spark support and 
hudi-utilities-slim-bundle (#12478)
    
    * [HUDI-8296] Improve docs around Hudi Spark support and 
hudi-utilities-slim-bundle
    
    * Fix one word
---
 website/docs/cleaning.md                   | 17 +++++---
 website/docs/cli.md                        |  3 +-
 website/docs/clustering.md                 |  6 ++-
 website/docs/compaction.md                 |  4 +-
 website/docs/concurrency_control.md        |  6 ++-
 website/docs/deployment.md                 | 15 +++----
 website/docs/gcp_bigquery.md               |  4 +-
 website/docs/hoodie_streaming_ingestion.md | 27 ++++++------
 website/docs/metadata_indexing.md          | 12 ++++--
 website/docs/migration_guide.md            |  3 +-
 website/docs/querying_data.md              |  2 +-
 website/docs/quick-start-guide.md          | 67 ++++--------------------------
 website/docs/snapshot_exporter.md          | 20 ++++-----
 website/docs/syncing_datahub.md            |  7 ++--
 14 files changed, 78 insertions(+), 115 deletions(-)

diff --git a/website/docs/cleaning.md b/website/docs/cleaning.md
index c050604c6e9..5f6ea4b3697 100644
--- a/website/docs/cleaning.md
+++ b/website/docs/cleaning.md
@@ -79,7 +79,9 @@ For Flink based writing, this is the default mode of 
cleaning. Please refer to [
 #### Run independently
 Hoodie Cleaner can also be run as a separate process. Following is the command 
for running the cleaner independently:
 ```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+spark-submit --master local \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.HoodieCleaner `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` 
--help
         Usage: <main class> [options]
         Options:
         --help, -h
@@ -101,7 +103,9 @@ spark-submit --master local --class 
org.apache.hudi.utilities.HoodieCleaner `ls
 Some examples to run the cleaner.    
 Keep the latest 10 commits
 ```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+spark-submit --master local \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.HoodieCleaner `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --target-base-path /path/to/hoodie_table \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
   --hoodie-conf hoodie.cleaner.commits.retained=10 \
@@ -109,15 +113,18 @@ spark-submit --master local --class 
org.apache.hudi.utilities.HoodieCleaner `ls
 ```
 Keep the latest 3 file versions
 ```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
-  --target-base-path /path/to/hoodie_table \
+spark-submit --master local \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.HoodieCleaner `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
   --hoodie-conf hoodie.cleaner.fileversions.retained=3 \
   --hoodie-conf hoodie.cleaner.parallelism=200
 ```
 Clean commits older than 24 hours
 ```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner 
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+spark-submit --master local \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.HoodieCleaner `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --target-base-path /path/to/hoodie_table \
   --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
   --hoodie-conf hoodie.cleaner.hours.retained=24 \
diff --git a/website/docs/cli.md b/website/docs/cli.md
index 7cc4cdd92b0..def32b11a8e 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -8,8 +8,7 @@ last_modified_at: 2021-08-18T15:59:57-04:00
 Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`.
 
 ### Hudi CLI Bundle setup
-In release `0.13.0` we have now added another way of launching the `hudi cli`, 
which is using the `hudi-cli-bundle`. (Note this is only supported for Spark3,
-for Spark2 please see the above Local setup section)
+In release `0.13.0` we have now added another way of launching the `hudi cli`, 
which is using the `hudi-cli-bundle`.
 
 There are a couple of requirements when using this approach such as having 
`spark` installed locally on your machine. 
 It is required to use a spark distribution with hadoop dependencies packaged 
such as `spark-3.3.1-bin-hadoop2.tgz` from 
https://archive.apache.org/dist/spark/.
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 80a8717e177..0bbbad9781a 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -243,8 +243,9 @@ A sample spark-submit command to setup HoodieClusteringJob 
is as below:
 
 ```bash
 spark-submit \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --class org.apache.hudi.utilities.HoodieClusteringJob \
-/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
 \
+/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
 --props /path/to/config/clusteringjob.properties \
 --mode scheduleAndExecute \
 --base-path /path/to/hudi_table/basePath \
@@ -272,8 +273,9 @@ A sample spark-submit command to setup HoodieStreamer is as 
below:
 
 ```bash
 spark-submit \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --class org.apache.hudi.utilities.streamer.HoodieStreamer \
-/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
 \
+/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
 --props /path/to/config/clustering_kafka.properties \
 --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider 
\
 --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index de5bd20a0e1..7859030052a 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -150,7 +150,7 @@ ingests data to Hudi table continuously from upstream 
sources. In this mode, Hud
 compactions. Here is an example snippet for running in continuous mode with 
async compactions
 
 ```properties
-spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
+spark-submit --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
 --class org.apache.hudi.utilities.streamer.HoodieStreamer \
 --table-type MERGE_ON_READ \
 --target-base-path <hudi_base_path> \
@@ -187,7 +187,7 @@ The compactor utility allows to do scheduling and execution 
of compaction.
 
 Example:
 ```properties
-spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
+spark-submit --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
 --class org.apache.hudi.utilities.HoodieCompactor \
 --base-path <base_path> \
 --table-name <table_name> \
diff --git a/website/docs/concurrency_control.md 
b/website/docs/concurrency_control.md
index bba21b45bc8..e14bd1c8206 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -245,14 +245,16 @@ hoodie.cleaner.policy.failed.writes=LAZY
 
 ### Multi Writing via Hudi Streamer
 
-The `HoodieStreamer` utility (part of hudi-utilities-bundle) provides ways to 
ingest from different sources such as DFS or Kafka, with the following 
capabilities.
+The `HoodieStreamer` utility (part of hudi-utilities-slim-bundle) provides 
ways to ingest from different sources such as DFS or Kafka, with the following 
capabilities.
 
 Using optimistic_concurrency_control via Hudi Streamer requires adding the 
above configs to the properties file that can be passed to the
 job. For example below, adding the configs to kafka-source.properties file and 
passing them to Hudi Streamer will enable optimistic concurrency.
 A Hudi Streamer job can then be triggered as follows:
 
 ```java
-[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+  --jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
+  --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --props 
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
 \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 7785f4ceaca..3e572867e79 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -29,20 +29,16 @@ With Merge_On_Read Table, Hudi ingestion needs to also take 
care of compacting d
 [Hudi Streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) is the 
standalone utility to incrementally pull upstream changes 
 from varied sources such as DFS, Kafka and DB Changelogs and ingest them to 
hudi tables.  It runs as a spark application in two modes.
 
-To use Hudi Streamer in Spark, the `hudi-utilities-bundle` is required, by 
adding
-`--packages org.apache.hudi:hudi-utilities-bundle_2.11:0.13.0` to the 
`spark-submit` command. From 0.11.0 release, we start
-to provide a new `hudi-utilities-slim-bundle` which aims to exclude 
dependencies that can cause conflicts and compatibility
-issues with different versions of Spark.  The `hudi-utilities-slim-bundle` 
should be used along with a Hudi Spark bundle 
-corresponding to the Spark version used, e.g., 
-`--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.13.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.13.0`,
-if using `hudi-utilities-bundle` solely in Spark encounters compatibility 
issues.
+To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark 
bundle are required, by adding
+`--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0`
 to the `spark-submit` command.
 
  - **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion 
round which includes incrementally pulling events from upstream sources and 
ingesting them to hudi table. Background operations like cleaning old file 
versions and archiving hoodie timeline are automatically executed as part of 
the run. For Merge-On-Read tables, Compaction is also run inline as part of 
ingestion unless disabled by passing the flag "--disable-compaction". By 
default, Compaction is run inline for eve [...]
 
 Here is an example invocation for reading from kafka topic in a single-run 
mode and writing to Merge On Read table type in a yarn cluster.
 
 ```java
-[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.13.0 \
+[hoodie]$ spark-submit \
+ --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
@@ -90,7 +86,8 @@ Here is an example invocation for reading from kafka topic in 
a single-run mode
 Here is an example invocation for reading from kafka topic in a continuous 
mode and writing to Merge On Read table type in a yarn cluster.
 
 ```java
-[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.13.0 \
+[hoodie]$ spark-submit \
+ --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
diff --git a/website/docs/gcp_bigquery.md b/website/docs/gcp_bigquery.md
index 9f7b12dbeb3..59f6e678f62 100644
--- a/website/docs/gcp_bigquery.md
+++ b/website/docs/gcp_bigquery.md
@@ -65,9 +65,9 @@ Below shows an example for running `BigQuerySyncTool` with 
`HoodieStreamer`.
 ```shell
 spark-submit --master yarn \
 --packages com.google.cloud:google-cloud-bigquery:2.10.4 \
---jars /opt/hudi-gcp-bundle-0.13.0.jar \
+--jars 
"/opt/hudi-gcp-bundle-0.13.0.jar,/opt/hudi-utilities-slim-bundle_2.12-1.0.0.jar,/opt/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --class org.apache.hudi.utilities.streamer.HoodieStreamer \
-/opt/hudi-utilities-bundle_2.12-0.13.0.jar \
+/opt/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
 --target-base-path gs://my-hoodie-table/path \
 --target-table mytable \
 --table-type COPY_ON_WRITE \
diff --git a/website/docs/hoodie_streaming_ingestion.md 
b/website/docs/hoodie_streaming_ingestion.md
index dca65b9b426..60586cbfc46 100644
--- a/website/docs/hoodie_streaming_ingestion.md
+++ b/website/docs/hoodie_streaming_ingestion.md
@@ -40,7 +40,9 @@ Expand this to see HoodieStreamer's "--help" output 
describing its capabilities
 </summary>
 
 ```shell
-[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+[hoodie]$ spark-submit \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` 
--help
 Usage: <main class> [options]
   Options:
     --allow-commit-on-no-checkpoint-change
@@ -254,7 +256,9 @@ For e.g: once you have Confluent Kafka, Schema registry up 
& running, produce so
 and then ingest it as follows.
 
 ```java
-[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --props 
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
 \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
@@ -266,16 +270,11 @@ and then ingest it as follows.
 
 In some cases, you may want to migrate your existing table into Hudi 
beforehand. Please refer to [migration guide](/docs/migration_guide).
 
-### Using `hudi-utilities` bundle jars
+### Using `hudi-utilities-slim-bundle` bundle jar
 
-From 0.11.0 release, we start to provide a new `hudi-utilities-slim-bundle` 
which aims to exclude dependencies that can
-cause conflicts and compatibility issues with different versions of Spark.
-
-It is recommended to switch to `hudi-utilities-slim-bundle`, which should be 
used along with a Hudi Spark bundle
+It is recommended to use `hudi-utilities-slim-bundle`, which should be used 
along with a Hudi Spark bundle
 corresponding the Spark version used to make utilities work with Spark, e.g.,
-`--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.13.0,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.13.0`.
-
-`hudi-utilities-bundle` remains as a legacy bundle jar to work with Spark 2.4 
and 3.1.
+`--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0`.
 
 ### Concurrency Control
 
@@ -292,7 +291,9 @@ As an example, adding the configs to 
`kafka-source.properties` file and passing
 A Hudi Streamer job can then be triggered as follows:
 
 ```java
-[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --props 
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
 \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
@@ -621,7 +622,9 @@ under `hudi-utilities/src/test/resources/streamer-config`. 
The command to run `H
 to how you run `HoodieStreamer`.
 
 ```java
-[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+  --class org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --props 
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
 \
   --config-folder file://tmp/hudi-ingestion-config \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
diff --git a/website/docs/metadata_indexing.md 
b/website/docs/metadata_indexing.md
index d1978c1e486..560d51cafed 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -159,7 +159,8 @@ hoodie.write.lock.zookeeper.base_path=<zk_base_path>
 
 ```bash
 spark-submit \
---class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar`
 \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
+--class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar`
 \
 --props `ls /Users/home/path/to/write/config.properties` \
 --source-class org.apache.hudi.utilities.sources.ParquetDFSSource  
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
\
 --source-ordering-field tpep_dropoff_datetime   \
@@ -211,8 +212,9 @@ Now, we can schedule indexing using `HoodieIndexer` in 
`schedule` mode as follow
 
 ```
 spark-submit \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --class org.apache.hudi.utilities.HoodieIndexer \
-/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar
 \
+/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
 --props /Users/home/path/to/indexer.properties \
 --mode schedule \
 --base-path /tmp/hudi-ny-taxi \
@@ -230,8 +232,9 @@ To execute indexing, run the indexer in `execute` mode as 
below.
 
 ```
 spark-submit \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --class org.apache.hudi.utilities.HoodieIndexer \
-/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar
 \
+/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
 --props /Users/home/path/to/indexer.properties \
 --mode execute \
 --base-path /tmp/hudi-ny-taxi \
@@ -285,8 +288,9 @@ To drop an index, just run the index in `dropindex` mode.
 
 ```
 spark-submit \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --class org.apache.hudi.utilities.HoodieIndexer \
-/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar
 \
+/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
 --props /Users/home/path/to/indexer.properties \
 --mode dropindex \
 --base-path /tmp/hudi-ny-taxi \
diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md
index 5b9d6bfe55d..c8839a2005f 100644
--- a/website/docs/migration_guide.md
+++ b/website/docs/migration_guide.md
@@ -54,8 +54,9 @@ mode to selective partitions based on the regex pattern 
[hoodie.bootstrap.mode.s
 
 ```
 spark-submit --master local \
+--jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
 --run-bootstrap \
 --target-base-path /tmp/hoodie/bootstrap_table \
 --target-table bootstrap_table \
diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md
index 83a03a4a112..d96d3b3875b 100644
--- a/website/docs/querying_data.md
+++ b/website/docs/querying_data.md
@@ -25,7 +25,7 @@ See the [Spark Quick Start](/docs/quick-start-guide) for more 
examples of Spark
 
 If your Spark environment does not have the Hudi jars installed, add 
[hudi-spark-bundle](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark-bundle)
 jar to the
 classpath of drivers and executors using `--jars` option. Alternatively, 
hudi-spark-bundle can also fetched via the
---packages options (e.g: --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.13.0).
+--packages options (e.g: --packages 
org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0).
 
 ### Snapshot query {#spark-snap-query}
 Retrieve the data table at the present point in time.
diff --git a/website/docs/quick-start-guide.md 
b/website/docs/quick-start-guide.md
index 4ddb4005df3..a9315c34e3b 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -12,7 +12,7 @@ we will walk through code snippets that allows you to insert, 
update, delete and
 
 ## Setup
 
-Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads) for setting up Spark.
+Hudi works with Spark 3.3 and above versions. You can follow instructions 
[here](https://spark.apache.org/downloads) for setting up Spark.
 
 ### Spark 3 Support Matrix
 
@@ -56,29 +56,14 @@ From the extracted directory run spark-shell with Hudi:
 
 
 ```shell
-# For Spark versions: 3.2 - 3.5
-export SPARK_VERSION=3.5 # or 3.4, 3.3, 3.2
+# For Spark versions: 3.3 - 3.5
+export SPARK_VERSION=3.5 # or 3.4, 3.3
 spark-shell --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
 --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
 ```
-```shell
-# For Spark versions: 3.0 - 3.1
-export SPARK_VERSION=3.1 # or 3.0
-spark-shell --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-```shell
-# For Spark version: 2.4
-spark-shell --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
 </TabItem>
 
 <TabItem value="python">
@@ -86,22 +71,11 @@ spark-shell --packages 
org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 \
 From the extracted directory run pyspark with Hudi:
 
 ```shell
-# For Spark versions: 3.2 - 3.5
+# For Spark versions: 3.3 - 3.5
 export PYSPARK_PYTHON=$(which python3)
-export SPARK_VERSION=3.5 # or 3.4, 3.3, 3.2
+export SPARK_VERSION=3.5 # or 3.4, 3.3
 pyspark --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
 ```
-```shell
-# For Spark versions: 3.0 - 3.1
-export PYSPARK_PYTHON=$(which python3)
-export SPARK_VERSION=3.1 # or 3.0
-pyspark --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-```shell
-# For Spark version: 2.4
-export PYSPARK_PYTHON=$(which python3)
-pyspark --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
 </TabItem>
 
 <TabItem value="sparksql">
@@ -110,30 +84,14 @@ Hudi support using Spark SQL to write and read data with 
the **HoodieSparkSessio
 From the extracted directory run Spark SQL with Hudi:
 
 ```shell
-# For Spark versions: 3.2 - 3.5
-export SPARK_VERSION=3.5 # or 3.4, 3.3, 3.2
+# For Spark versions: 3.3 - 3.5
+export SPARK_VERSION=3.5 # or 3.4, 3.3
 spark-sql --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
 --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
 --conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
 ```
-```shell
-# For Spark versions: 3.0 - 3.1
-export SPARK_VERSION=3.1 # or 3.0
-spark-sql --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-```shell
-# For Spark version: 2.4
-spark-sql --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-
 </TabItem>
 </Tabs>
 
@@ -145,14 +103,6 @@ Users are recommended to set this config to reduce Kryo 
serialization overhead
 ```
 :::
 
-:::note for Spark 3.2 and higher versions
-Use scala 2.12 builds with an additional config: 
-
-```
---conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
-```
-:::
-
 ### Setup project
 Below, we do imports and setup the table name and corresponding base path.
 
@@ -907,9 +857,6 @@ SELECT * FROM hudi_table TIMESTAMP AS OF 
'20220307091628793' WHERE id = 1;
 SELECT * FROM hudi_table TIMESTAMP AS OF '2022-03-07 09:16:28.100' WHERE id = 
1;
 SELECT * FROM hudi_table TIMESTAMP AS OF '2022-03-08' WHERE id = 1;
 ```
-:::note
-Requires Spark 3.2+
-:::
 
 </TabItem>
 
diff --git a/website/docs/snapshot_exporter.md 
b/website/docs/snapshot_exporter.md
index 07986d0bb8b..59544734bf6 100644
--- a/website/docs/snapshot_exporter.md
+++ b/website/docs/snapshot_exporter.md
@@ -31,10 +31,10 @@ query, perform any repartitioning if required and will 
write the data as Hudi, p
 Exporter scans the source dataset and then makes a copy of it to the target 
output path.
 ```bash
 spark-submit \
-  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+      
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
   --source-base-path "/tmp/" \
   --target-output-path "/tmp/exported/hudi/" \
   --output-format "hudi"
@@ -45,10 +45,10 @@ The Exporter can also convert the source dataset into other 
formats. Currently o
 
 ```bash
 spark-submit \
-  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+      
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
   --source-base-path "/tmp/" \
   --target-output-path "/tmp/exported/json/" \
   --output-format "json"  # or "parquet"
@@ -60,10 +60,10 @@ implementation of 
`org.apache.hudi.utilities.transform.Transformer` via `--trans
 
 ```bash
 spark-submit \
-  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+      
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
   --source-base-path "/tmp/" \
   --target-output-path "/tmp/exported/json/" \
   --transformer-class 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
@@ -80,10 +80,10 @@ By default, if no partitioning parameters are given, the 
output dataset will hav
 Example:
 ```bash
 spark-submit \
-  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \  
+      
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \  
   --source-base-path "/tmp/" \
   --target-output-path "/tmp/exported/json/" \
   --output-format "json" \
@@ -125,10 +125,10 @@ After putting this class in `my-custom.jar`, which is 
then placed on the job cla
 
 ```bash
 spark-submit \
-  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar,my-custom.jar"
 \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar,my-custom.jar"
 \
   --deploy-mode "client" \
   --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+      
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
 \
   --source-base-path "/tmp/" \
   --target-output-path "/tmp/exported/json/" \
   --output-format "json" \
diff --git a/website/docs/syncing_datahub.md b/website/docs/syncing_datahub.md
index 40fcd1d1891..89cf9bf8799 100644
--- a/website/docs/syncing_datahub.md
+++ b/website/docs/syncing_datahub.md
@@ -31,14 +31,15 @@ the URN creation.
 
 The following shows an example configuration to run `HoodieStreamer` with 
`DataHubSyncTool`.
 
-In addition to `hudi-utilities-bundle` that contains `HoodieStreamer`, you 
also add `hudi-datahub-sync-bundle` to
+In addition to `hudi-utilities-slim-bundle` that contains `HoodieStreamer`, 
you also add `hudi-datahub-sync-bundle` to
 the classpath.
 
 ```shell
 spark-submit --master yarn \
---jars /opt/hudi-datahub-sync-bundle-0.13.0.jar \
+--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
 \
+--jars /opt/hudi-datahub-sync-bundle-1.0.0.jar \
 --class org.apache.hudi.utilities.streamer.HoodieStreamer \
-/opt/hudi-utilities-bundle_2.12-0.13.0.jar \
+/opt/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
 --target-table mytable \
 # ... other HoodieStreamer's configs
 --enable-sync \

(hudi) branch asf-site updated: [HUDI-8296] Improve docs around Hudi Spark support and hudi-utilities-slim-bundle (#12478)

Reply via email to