This is an automated email from the ASF dual-hosted git repository.
codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new b148a8dbbbb [HUDI-8296] Improve docs around Hudi Spark support and
hudi-utilities-slim-bundle (#12478)
b148a8dbbbb is described below
commit b148a8dbbbba7bea8ae3f44dec0d5c07c898add3
Author: Y Ethan Guo <[email protected]>
AuthorDate: Thu Dec 12 17:45:24 2024 -0800
[HUDI-8296] Improve docs around Hudi Spark support and
hudi-utilities-slim-bundle (#12478)
* [HUDI-8296] Improve docs around Hudi Spark support and
hudi-utilities-slim-bundle
* Fix one word
---
website/docs/cleaning.md | 17 +++++---
website/docs/cli.md | 3 +-
website/docs/clustering.md | 6 ++-
website/docs/compaction.md | 4 +-
website/docs/concurrency_control.md | 6 ++-
website/docs/deployment.md | 15 +++----
website/docs/gcp_bigquery.md | 4 +-
website/docs/hoodie_streaming_ingestion.md | 27 ++++++------
website/docs/metadata_indexing.md | 12 ++++--
website/docs/migration_guide.md | 3 +-
website/docs/querying_data.md | 2 +-
website/docs/quick-start-guide.md | 67 ++++--------------------------
website/docs/snapshot_exporter.md | 20 ++++-----
website/docs/syncing_datahub.md | 7 ++--
14 files changed, 78 insertions(+), 115 deletions(-)
diff --git a/website/docs/cleaning.md b/website/docs/cleaning.md
index c050604c6e9..5f6ea4b3697 100644
--- a/website/docs/cleaning.md
+++ b/website/docs/cleaning.md
@@ -79,7 +79,9 @@ For Flink based writing, this is the default mode of
cleaning. Please refer to [
#### Run independently
Hoodie Cleaner can also be run as a separate process. Following is the command
for running the cleaner independently:
```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+spark-submit --master local \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.HoodieCleaner `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar`
--help
Usage: <main class> [options]
Options:
--help, -h
@@ -101,7 +103,9 @@ spark-submit --master local --class
org.apache.hudi.utilities.HoodieCleaner `ls
Some examples to run the cleaner.
Keep the latest 10 commits
```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+spark-submit --master local \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.HoodieCleaner `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--target-base-path /path/to/hoodie_table \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
--hoodie-conf hoodie.cleaner.commits.retained=10 \
@@ -109,15 +113,18 @@ spark-submit --master local --class
org.apache.hudi.utilities.HoodieCleaner `ls
```
Keep the latest 3 file versions
```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
- --target-base-path /path/to/hoodie_table \
+spark-submit --master local \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.HoodieCleaner `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
--hoodie-conf hoodie.cleaner.fileversions.retained=3 \
--hoodie-conf hoodie.cleaner.parallelism=200
```
Clean commits older than 24 hours
```
-spark-submit --master local --class org.apache.hudi.utilities.HoodieCleaner
`ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar`\
+spark-submit --master local \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.HoodieCleaner `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--target-base-path /path/to/hoodie_table \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_BY_HOURS \
--hoodie-conf hoodie.cleaner.hours.retained=24 \
diff --git a/website/docs/cli.md b/website/docs/cli.md
index 7cc4cdd92b0..def32b11a8e 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -8,8 +8,7 @@ last_modified_at: 2021-08-18T15:59:57-04:00
Once hudi has been built, the shell can be fired by via `cd hudi-cli &&
./hudi-cli.sh`.
### Hudi CLI Bundle setup
-In release `0.13.0` we have now added another way of launching the `hudi cli`,
which is using the `hudi-cli-bundle`. (Note this is only supported for Spark3,
-for Spark2 please see the above Local setup section)
+In release `0.13.0` we have now added another way of launching the `hudi cli`,
which is using the `hudi-cli-bundle`.
There are a couple of requirements when using this approach such as having
`spark` installed locally on your machine.
It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.3.1-bin-hadoop2.tgz` from
https://archive.apache.org/dist/spark/.
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 80a8717e177..0bbbad9781a 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -243,8 +243,9 @@ A sample spark-submit command to setup HoodieClusteringJob
is as below:
```bash
spark-submit \
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--class org.apache.hudi.utilities.HoodieClusteringJob \
-/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
\
+/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--props /path/to/config/clusteringjob.properties \
--mode scheduleAndExecute \
--base-path /path/to/hudi_table/basePath \
@@ -272,8 +273,9 @@ A sample spark-submit command to setup HoodieStreamer is as
below:
```bash
spark-submit \
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--class org.apache.hudi.utilities.streamer.HoodieStreamer \
-/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
\
+/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--props /path/to/config/clustering_kafka.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider
\
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index de5bd20a0e1..7859030052a 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -150,7 +150,7 @@ ingests data to Hudi table continuously from upstream
sources. In this mode, Hud
compactions. Here is an example snippet for running in continuous mode with
async compactions
```properties
-spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
+spark-submit --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
--class org.apache.hudi.utilities.streamer.HoodieStreamer \
--table-type MERGE_ON_READ \
--target-base-path <hudi_base_path> \
@@ -187,7 +187,7 @@ The compactor utility allows to do scheduling and execution
of compaction.
Example:
```properties
-spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
+spark-submit --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
--class org.apache.hudi.utilities.HoodieCompactor \
--base-path <base_path> \
--table-name <table_name> \
diff --git a/website/docs/concurrency_control.md
b/website/docs/concurrency_control.md
index bba21b45bc8..e14bd1c8206 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -245,14 +245,16 @@ hoodie.cleaner.policy.failed.writes=LAZY
### Multi Writing via Hudi Streamer
-The `HoodieStreamer` utility (part of hudi-utilities-bundle) provides ways to
ingest from different sources such as DFS or Kafka, with the following
capabilities.
+The `HoodieStreamer` utility (part of hudi-utilities-slim-bundle) provides
ways to ingest from different sources such as DFS or Kafka, with the following
capabilities.
Using optimistic_concurrency_control via Hudi Streamer requires adding the
above configs to the properties file that can be passed to the
job. For example below, adding the configs to kafka-source.properties file and
passing them to Hudi Streamer will enable optimistic concurrency.
A Hudi Streamer job can then be triggered as follows:
```java
-[hoodie]$ spark-submit --class
org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+ --jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
+ --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--props
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
\
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 7785f4ceaca..3e572867e79 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -29,20 +29,16 @@ With Merge_On_Read Table, Hudi ingestion needs to also take
care of compacting d
[Hudi Streamer](/docs/hoodie_streaming_ingestion#hudi-streamer) is the
standalone utility to incrementally pull upstream changes
from varied sources such as DFS, Kafka and DB Changelogs and ingest them to
hudi tables. It runs as a spark application in two modes.
-To use Hudi Streamer in Spark, the `hudi-utilities-bundle` is required, by
adding
-`--packages org.apache.hudi:hudi-utilities-bundle_2.11:0.13.0` to the
`spark-submit` command. From 0.11.0 release, we start
-to provide a new `hudi-utilities-slim-bundle` which aims to exclude
dependencies that can cause conflicts and compatibility
-issues with different versions of Spark. The `hudi-utilities-slim-bundle`
should be used along with a Hudi Spark bundle
-corresponding to the Spark version used, e.g.,
-`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.13.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.13.0`,
-if using `hudi-utilities-bundle` solely in Spark encounters compatibility
issues.
+To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark
bundle are required, by adding
+`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0`
to the `spark-submit` command.
- **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion
round which includes incrementally pulling events from upstream sources and
ingesting them to hudi table. Background operations like cleaning old file
versions and archiving hoodie timeline are automatically executed as part of
the run. For Merge-On-Read tables, Compaction is also run inline as part of
ingestion unless disabled by passing the flag "--disable-compaction". By
default, Compaction is run inline for eve [...]
Here is an example invocation for reading from kafka topic in a single-run
mode and writing to Merge On Read table type in a yarn cluster.
```java
-[hoodie]$ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.13.0 \
+[hoodie]$ spark-submit \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
@@ -90,7 +86,8 @@ Here is an example invocation for reading from kafka topic in
a single-run mode
Here is an example invocation for reading from kafka topic in a continuous
mode and writing to Merge On Read table type in a yarn cluster.
```java
-[hoodie]$ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.13.0 \
+[hoodie]$ spark-submit \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
diff --git a/website/docs/gcp_bigquery.md b/website/docs/gcp_bigquery.md
index 9f7b12dbeb3..59f6e678f62 100644
--- a/website/docs/gcp_bigquery.md
+++ b/website/docs/gcp_bigquery.md
@@ -65,9 +65,9 @@ Below shows an example for running `BigQuerySyncTool` with
`HoodieStreamer`.
```shell
spark-submit --master yarn \
--packages com.google.cloud:google-cloud-bigquery:2.10.4 \
---jars /opt/hudi-gcp-bundle-0.13.0.jar \
+--jars
"/opt/hudi-gcp-bundle-0.13.0.jar,/opt/hudi-utilities-slim-bundle_2.12-1.0.0.jar,/opt/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--class org.apache.hudi.utilities.streamer.HoodieStreamer \
-/opt/hudi-utilities-bundle_2.12-0.13.0.jar \
+/opt/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
--target-base-path gs://my-hoodie-table/path \
--target-table mytable \
--table-type COPY_ON_WRITE \
diff --git a/website/docs/hoodie_streaming_ingestion.md
b/website/docs/hoodie_streaming_ingestion.md
index dca65b9b426..60586cbfc46 100644
--- a/website/docs/hoodie_streaming_ingestion.md
+++ b/website/docs/hoodie_streaming_ingestion.md
@@ -40,7 +40,9 @@ Expand this to see HoodieStreamer's "--help" output
describing its capabilities
</summary>
```shell
-[hoodie]$ spark-submit --class
org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
+[hoodie]$ spark-submit \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar`
--help
Usage: <main class> [options]
Options:
--allow-commit-on-no-checkpoint-change
@@ -254,7 +256,9 @@ For e.g: once you have Confluent Kafka, Schema registry up
& running, produce so
and then ingest it as follows.
```java
-[hoodie]$ spark-submit --class
org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--props
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
\
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
@@ -266,16 +270,11 @@ and then ingest it as follows.
In some cases, you may want to migrate your existing table into Hudi
beforehand. Please refer to [migration guide](/docs/migration_guide).
-### Using `hudi-utilities` bundle jars
+### Using `hudi-utilities-slim-bundle` bundle jar
-From 0.11.0 release, we start to provide a new `hudi-utilities-slim-bundle`
which aims to exclude dependencies that can
-cause conflicts and compatibility issues with different versions of Spark.
-
-It is recommended to switch to `hudi-utilities-slim-bundle`, which should be
used along with a Hudi Spark bundle
+It is recommended to use `hudi-utilities-slim-bundle`, which should be used
along with a Hudi Spark bundle
corresponding the Spark version used to make utilities work with Spark, e.g.,
-`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.13.0,org.apache.hudi:hudi-spark3.2-bundle_2.12:0.13.0`.
-
-`hudi-utilities-bundle` remains as a legacy bundle jar to work with Spark 2.4
and 3.1.
+`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0`.
### Concurrency Control
@@ -292,7 +291,9 @@ As an example, adding the configs to
`kafka-source.properties` file and passing
A Hudi Streamer job can then be triggered as follows:
```java
-[hoodie]$ spark-submit --class
org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--props
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
\
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
@@ -621,7 +622,9 @@ under `hudi-utilities/src/test/resources/streamer-config`.
The command to run `H
to how you run `HoodieStreamer`.
```java
-[hoodie]$ spark-submit --class
org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+[hoodie]$ spark-submit \
+ --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+ --class org.apache.hudi.utilities.streamer.HoodieMultiTableStreamer `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--props
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
\
--config-folder file://tmp/hudi-ingestion-config \
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
diff --git a/website/docs/metadata_indexing.md
b/website/docs/metadata_indexing.md
index d1978c1e486..560d51cafed 100644
--- a/website/docs/metadata_indexing.md
+++ b/website/docs/metadata_indexing.md
@@ -159,7 +159,8 @@ hoodie.write.lock.zookeeper.base_path=<zk_base_path>
```bash
spark-submit \
---class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar`
\
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
+--class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar`
\
--props `ls /Users/home/path/to/write/config.properties` \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
\
--source-ordering-field tpep_dropoff_datetime \
@@ -211,8 +212,9 @@ Now, we can schedule indexing using `HoodieIndexer` in
`schedule` mode as follow
```
spark-submit \
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--class org.apache.hudi.utilities.HoodieIndexer \
-/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar
\
+/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--props /Users/home/path/to/indexer.properties \
--mode schedule \
--base-path /tmp/hudi-ny-taxi \
@@ -230,8 +232,9 @@ To execute indexing, run the indexer in `execute` mode as
below.
```
spark-submit \
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--class org.apache.hudi.utilities.HoodieIndexer \
-/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar
\
+/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--props /Users/home/path/to/indexer.properties \
--mode execute \
--base-path /tmp/hudi-ny-taxi \
@@ -285,8 +288,9 @@ To drop an index, just run the index in `dropindex` mode.
```
spark-submit \
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--class org.apache.hudi.utilities.HoodieIndexer \
-/Users/home/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.13.0.jar
\
+/Users/home/path/to/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--props /Users/home/path/to/indexer.properties \
--mode dropindex \
--base-path /tmp/hudi-ny-taxi \
diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md
index 5b9d6bfe55d..c8839a2005f 100644
--- a/website/docs/migration_guide.md
+++ b/website/docs/migration_guide.md
@@ -54,8 +54,9 @@ mode to selective partitions based on the regex pattern
[hoodie.bootstrap.mode.s
```
spark-submit --master local \
+--jars
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar"
\
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--class org.apache.hudi.utilities.streamer.HoodieStreamer `ls
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
--run-bootstrap \
--target-base-path /tmp/hoodie/bootstrap_table \
--target-table bootstrap_table \
diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md
index 83a03a4a112..d96d3b3875b 100644
--- a/website/docs/querying_data.md
+++ b/website/docs/querying_data.md
@@ -25,7 +25,7 @@ See the [Spark Quick Start](/docs/quick-start-guide) for more
examples of Spark
If your Spark environment does not have the Hudi jars installed, add
[hudi-spark-bundle](https://mvnrepository.com/artifact/org.apache.hudi/hudi-spark-bundle)
jar to the
classpath of drivers and executors using `--jars` option. Alternatively,
hudi-spark-bundle can also fetched via the
---packages options (e.g: --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.13.0).
+--packages options (e.g: --packages
org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0).
### Snapshot query {#spark-snap-query}
Retrieve the data table at the present point in time.
diff --git a/website/docs/quick-start-guide.md
b/website/docs/quick-start-guide.md
index 4ddb4005df3..a9315c34e3b 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -12,7 +12,7 @@ we will walk through code snippets that allows you to insert,
update, delete and
## Setup
-Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can follow instructions
[here](https://spark.apache.org/downloads) for setting up Spark.
+Hudi works with Spark 3.3 and above versions. You can follow instructions
[here](https://spark.apache.org/downloads) for setting up Spark.
### Spark 3 Support Matrix
@@ -56,29 +56,14 @@ From the extracted directory run spark-shell with Hudi:
```shell
-# For Spark versions: 3.2 - 3.5
-export SPARK_VERSION=3.5 # or 3.4, 3.3, 3.2
+# For Spark versions: 3.3 - 3.5
+export SPARK_VERSION=3.5 # or 3.4, 3.3
spark-shell --packages
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
```
-```shell
-# For Spark versions: 3.0 - 3.1
-export SPARK_VERSION=3.1 # or 3.0
-spark-shell --packages
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-```shell
-# For Spark version: 2.4
-spark-shell --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
</TabItem>
<TabItem value="python">
@@ -86,22 +71,11 @@ spark-shell --packages
org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 \
From the extracted directory run pyspark with Hudi:
```shell
-# For Spark versions: 3.2 - 3.5
+# For Spark versions: 3.3 - 3.5
export PYSPARK_PYTHON=$(which python3)
-export SPARK_VERSION=3.5 # or 3.4, 3.3, 3.2
+export SPARK_VERSION=3.5 # or 3.4, 3.3
pyspark --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
```
-```shell
-# For Spark versions: 3.0 - 3.1
-export PYSPARK_PYTHON=$(which python3)
-export SPARK_VERSION=3.1 # or 3.0
-pyspark --packages org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-```shell
-# For Spark version: 2.4
-export PYSPARK_PYTHON=$(which python3)
-pyspark --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
</TabItem>
<TabItem value="sparksql">
@@ -110,30 +84,14 @@ Hudi support using Spark SQL to write and read data with
the **HoodieSparkSessio
From the extracted directory run Spark SQL with Hudi:
```shell
-# For Spark versions: 3.2 - 3.5
-export SPARK_VERSION=3.5 # or 3.4, 3.3, 3.2
+# For Spark versions: 3.3 - 3.5
+export SPARK_VERSION=3.5 # or 3.4, 3.3
spark-sql --packages
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
```
-```shell
-# For Spark versions: 3.0 - 3.1
-export SPARK_VERSION=3.1 # or 3.0
-spark-sql --packages
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-```shell
-# For Spark version: 2.4
-spark-sql --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.15.0 \
---conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
---conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
---conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'
-```
-
</TabItem>
</Tabs>
@@ -145,14 +103,6 @@ Users are recommended to set this config to reduce Kryo
serialization overhead
```
:::
-:::note for Spark 3.2 and higher versions
-Use scala 2.12 builds with an additional config:
-
-```
---conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
-```
-:::
-
### Setup project
Below, we do imports and setup the table name and corresponding base path.
@@ -907,9 +857,6 @@ SELECT * FROM hudi_table TIMESTAMP AS OF
'20220307091628793' WHERE id = 1;
SELECT * FROM hudi_table TIMESTAMP AS OF '2022-03-07 09:16:28.100' WHERE id =
1;
SELECT * FROM hudi_table TIMESTAMP AS OF '2022-03-08' WHERE id = 1;
```
-:::note
-Requires Spark 3.2+
-:::
</TabItem>
diff --git a/website/docs/snapshot_exporter.md
b/website/docs/snapshot_exporter.md
index 07986d0bb8b..59544734bf6 100644
--- a/website/docs/snapshot_exporter.md
+++ b/website/docs/snapshot_exporter.md
@@ -31,10 +31,10 @@ query, perform any repartitioning if required and will
write the data as Hudi, p
Exporter scans the source dataset and then makes a copy of it to the target
output path.
```bash
spark-submit \
- --jars
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+ --jars
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/hudi/" \
--output-format "hudi"
@@ -45,10 +45,10 @@ The Exporter can also convert the source dataset into other
formats. Currently o
```bash
spark-submit \
- --jars
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+ --jars
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--output-format "json" # or "parquet"
@@ -60,10 +60,10 @@ implementation of
`org.apache.hudi.utilities.transform.Transformer` via `--trans
```bash
spark-submit \
- --jars
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+ --jars
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--transformer-class
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
@@ -80,10 +80,10 @@ By default, if no partitioning parameters are given, the
output dataset will hav
Example:
```bash
spark-submit \
- --jars
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+ --jars
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--output-format "json" \
@@ -125,10 +125,10 @@ After putting this class in `my-custom.jar`, which is
then placed on the job cla
```bash
spark-submit \
- --jars
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar,my-custom.jar"
\
+ --jars
"packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0.jar,my-custom.jar"
\
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
-
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.0.jar
\
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--output-format "json" \
diff --git a/website/docs/syncing_datahub.md b/website/docs/syncing_datahub.md
index 40fcd1d1891..89cf9bf8799 100644
--- a/website/docs/syncing_datahub.md
+++ b/website/docs/syncing_datahub.md
@@ -31,14 +31,15 @@ the URN creation.
The following shows an example configuration to run `HoodieStreamer` with
`DataHubSyncTool`.
-In addition to `hudi-utilities-bundle` that contains `HoodieStreamer`, you
also add `hudi-datahub-sync-bundle` to
+In addition to `hudi-utilities-slim-bundle` that contains `HoodieStreamer`,
you also add `hudi-datahub-sync-bundle` to
the classpath.
```shell
spark-submit --master yarn \
---jars /opt/hudi-datahub-sync-bundle-0.13.0.jar \
+--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.0,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.0
\
+--jars /opt/hudi-datahub-sync-bundle-1.0.0.jar \
--class org.apache.hudi.utilities.streamer.HoodieStreamer \
-/opt/hudi-utilities-bundle_2.12-0.13.0.jar \
+/opt/hudi-utilities-slim-bundle_2.12-1.0.0.jar \
--target-table mytable \
# ... other HoodieStreamer's configs
--enable-sync \