This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 85c72daf0b [HUDI-3680][HUDI-3926] Update docs for Spark, utilities,
and utilities-slim bundles (#5454)
85c72daf0b is described below
commit 85c72daf0b3df88d3556f51c921bed0485495e05
Author: Y Ethan Guo <[email protected]>
AuthorDate: Fri Apr 29 02:42:27 2022 -0700
[HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim
bundles (#5454)
---
website/docs/deployment.md | 15 +++++--
website/docs/docker_demo.md | 12 ++----
website/docs/hoodie_deltastreamer.md | 8 +++-
website/docs/quick-start-guide.md | 76 +++++++++++++++++++++---------------
website/docs/syncing_metastore.md | 1 -
5 files changed, 67 insertions(+), 45 deletions(-)
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 739480205d..a4a57fb6b0 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -25,14 +25,23 @@ With Merge_On_Read Table, Hudi ingestion needs to also take
care of compacting d
### DeltaStreamer
-[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone
utility to incrementally pull upstream changes from varied sources such as DFS,
Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark
application in 2 modes.
+[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone
utility to incrementally pull upstream changes
+from varied sources such as DFS, Kafka and DB Changelogs and ingest them to
hudi tables. It runs as a spark application in two modes.
+
+To use DeltaStreamer in Spark, the `hudi-utilities-bundle` is required, by
adding
+`--packages org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0` to the
`spark-submit` command. From 0.11.0 release, we start
+to provide a new `hudi-utilities-slim-bundle` which aims to exclude
dependencies that can cause conflicts and compatibility
+issues with different versions of Spark. The `hudi-utilities-slim-bundle`
should be used along with a Hudi Spark bundle
+corresponding to the Spark version used, e.g.,
+`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.11.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0`,
+if using `hudi-utilities-bundle` solely in Spark encounters compatibility
issues.
- **Run Once Mode** : In this mode, Deltastreamer performs one ingestion
round which includes incrementally pulling events from upstream sources and
ingesting them to hudi table. Background operations like cleaning old file
versions and archiving hoodie timeline are automatically executed as part of
the run. For Merge-On-Read tables, Compaction is also run inline as part of
ingestion unless disabled by passing the flag "--disable-compaction". By
default, Compaction is run inline for eve [...]
Here is an example invocation for reading from kafka topic in a single-run
mode and writing to Merge On Read table type in a yarn cluster.
```java
-[hoodie]$ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
\
+[hoodie]$ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0 \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
@@ -80,7 +89,7 @@ Here is an example invocation for reading from kafka topic in
a single-run mode
Here is an example invocation for reading from kafka topic in a continuous
mode and writing to Merge On Read table type in a yarn cluster.
```java
-[hoodie]$ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
\
+[hoodie]$ spark-submit --packages
org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0 \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md
index eeccf117aa..26d41251bc 100644
--- a/website/docs/docker_demo.md
+++ b/website/docs/docker_demo.md
@@ -391,8 +391,7 @@ $SPARK_INSTALL/bin/spark-shell \
--deploy-mode client \
--driver-memory 1G \
--executor-memory 3G \
- --num-executors 1 \
- --packages org.apache.spark:spark-avro_2.11:2.4.4
+ --num-executors 1
...
Welcome to
@@ -793,8 +792,7 @@ $SPARK_INSTALL/bin/spark-shell \
--driver-memory 1G \
--master local[2] \
--executor-memory 3G \
- --num-executors 1 \
- --packages org.apache.spark:spark-avro_2.11:2.4.4
+ --num-executors 1
# Copy On Write Table:
@@ -1050,8 +1048,7 @@ $SPARK_INSTALL/bin/spark-shell \
--driver-memory 1G \
--master local[2] \
--executor-memory 3G \
- --num-executors 1 \
- --packages org.apache.spark:spark-avro_2.11:2.4.4
+ --num-executors 1
Welcome to
____ __
@@ -1247,8 +1244,7 @@ $SPARK_INSTALL/bin/spark-shell \
--driver-memory 1G \
--master local[2] \
--executor-memory 3G \
- --num-executors 1 \
- --packages org.apache.spark:spark-avro_2.11:2.4.4
+ --num-executors 1
# Read Optimized Query
scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by
symbol HAVING symbol = 'GOOG'").show(100, false)
diff --git a/website/docs/hoodie_deltastreamer.md
b/website/docs/hoodie_deltastreamer.md
index ae87c579cd..6f2c80d5cf 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -5,7 +5,7 @@ keywords: [hudi, deltastreamer, hoodiedeltastreamer]
## DeltaStreamer
-The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the
way to ingest from different sources such as DFS or Kafka, with the following
capabilities.
+The `HoodieDeltaStreamer` utility (part of `hudi-utilities-bundle`) provides
the way to ingest from different sources such as DFS or Kafka, with the
following capabilities.
- Exactly once ingestion of new events from Kafka, [incremental
imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
- Support json, avro or a custom record types for the incoming data
@@ -151,6 +151,12 @@ and then ingest it as follows.
In some cases, you may want to migrate your existing table into Hudi
beforehand. Please refer to [migration guide](/docs/migration_guide).
+From 0.11.0 release, we start to provide a new `hudi-utilities-slim-bundle`
which aims to exclude dependencies that can
+cause conflicts and compatibility issues with different versions of Spark.
The `hudi-utilities-slim-bundle` should be
+used along with a Hudi Spark bundle corresponding the Spark version used to
make utilities work with Spark, e.g.,
+`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.11.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0`,
+if using `hudi-utilities-bundle` solely to run `HoodieDeltaStreamer` in Spark
encounters compatibility issues.
+
### MultiTableDeltaStreamer
`HoodieMultiTableDeltaStreamer`, a wrapper on top of `HoodieDeltaStreamer`,
enables one to ingest multiple tables at a single go into hudi datasets.
Currently it only supports sequential processing of tables to be ingested and
COPY_ON_WRITE storage type. The command line options for
`HoodieMultiTableDeltaStreamer` are pretty much similar to
`HoodieDeltaStreamer` with the only exception that you are required to provide
table wise configs in separate files in a dedicated config folder. The [...]
diff --git a/website/docs/quick-start-guide.md
b/website/docs/quick-start-guide.md
index 0841351f63..8f77a34dd7 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -20,7 +20,7 @@ Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can
follow instructions [
| Hudi | Supported Spark 3 version |
|:----------------|:------------------------------|
-| 0.11.0 | 3.2.x (default build), 3.1.x |
+| 0.11.0 | 3.2.x (default build, Spark bundle only), 3.1.x |
| 0.10.0 | 3.1.x (default build), 3.0.x |
| 0.7.0 - 0.9.0 | 3.0.x |
| 0.6.0 and prior | not supported |
@@ -29,6 +29,16 @@ Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can
follow instructions [
As of 0.9.0 release, Spark SQL DML support has been added and is experimental.
+In 0.11.0 release, we add support for Spark 3.2.x and continue the support for
Spark 3.1.x and Spark 2.4.x. We officially
+do not provide the support for Spark 3.0.x any more. To make it easier for
the users to pick the right Hudi Spark bundle
+in their deployment, we make the following adjustment to the naming of the
bundles:
+
+- For each supported Spark minor version, there is a corresponding Hudi Spark
bundle with the major and minor version
+in the naming, i.e., `hudi-spark3.2-bundle`, `hudi-spark3.1-bundle`, and
`hudi-spark2.4-bundle`.
+- We encourage users to migrate to using the new bundles above. We keep the
bundles with the legacy naming in this
+release, i.e., `hudi-spark3-bundle` targeting at Spark 3.2.x, the latest Spark
3 version, and `hudi-spark-bundle` for
+Spark 2.4.x.
+
<Tabs
defaultValue="scala"
values={[
@@ -41,24 +51,25 @@ values={[
From the extracted directory run spark-shell with Hudi as:
```scala
-// spark-shell for spark 3.1
+// spark-shell for spark 3.2
spark-shell \
- --packages
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
\
- --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+ --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
+ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+ --conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
-// spark-shell for spark 3.2
+// spark-shell for spark 3.1
spark-shell \
- --packages
org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
\
+ --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
-// spark-shell for spark 2 with scala 2.12
+// spark-shell for spark 2.4 with scala 2.12
spark-shell \
- --packages
org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
\
+ --packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
-// spark-shell for spark 2 with scala 2.11
+// spark-shell for spark 2.4 with scala 2.11
spark-shell \
- --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
\
+ --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```
@@ -69,24 +80,25 @@ Hudi support using Spark SQL to write and read data with
the **HoodieSparkSessio
From the extracted directory run Spark SQL with Hudi as:
```shell
-# Spark SQL for spark 3.1
-spark-sql --packages
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
\
+# Spark SQL for spark 3.2
+spark-sql --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
-# Spark SQL for spark 3.0
-spark-sql --packages
org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
\
+# Spark SQL for spark 3.1
+spark-sql --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
-# Spark SQL for spark 2 with scala 2.11
-spark-sql --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
\
+# Spark SQL for spark 2.4 with scala 2.11
+spark-sql --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
-# Spark SQL for spark 2 with scala 2.12
+# Spark SQL for spark 2.4 with scala 2.12
spark-sql \
- --packages
org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
\
+ --packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
```
@@ -100,24 +112,25 @@ From the extracted directory run pyspark with Hudi as:
# pyspark
export PYSPARK_PYTHON=$(which python3)
-# for spark3.1
+# for spark3.2
pyspark
---packages
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
+--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
-# for spark3.0
+# for spark3.1
pyspark
---packages
org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
+--packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
-# for spark2 with scala 2.12
+# for spark2.4 with scala 2.12
pyspark
---packages
org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
+--packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
-# for spark2 with scala 2.11
+# for spark2.4 with scala 2.11
pyspark
---packages
org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
+--packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```
@@ -126,10 +139,9 @@ pyspark
:::note Please note the following
<ul>
- <li>spark-avro module needs to be specified in --packages as it is not
included with spark-shell by default</li>
- <li>spark-avro and spark versions must match (we have used 3.1.2 for both
above)</li>
- <li>we have used hudi-spark-bundle built for scala 2.12 since the spark-avro
module used also depends on 2.12.
- If spark-avro_2.11 is used, correspondingly hudi-spark-bundle_2.11
needs to be used. </li>
+ <li> For Spark 3.2, the additional spark_catalog config is required:
+--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
</li>
+ <li> We have used hudi-spark-bundle built for scala 2.12 since the
spark-avro module used can also depend on 2.12. </li>
</ul>
:::
@@ -1175,8 +1187,8 @@ more details please refer to [procedures](procedures).
## Where to go from here?
You can also do the quickstart by [building hudi
yourself](https://github.com/apache/hudi#building-apache-hudi-from-source),
-and using `--jars <path to
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.1?-*.*.*-SNAPSHOT.jar`
in the spark-shell command above
-instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`.
Hudi also supports scala 2.12. Refer [build with scala
2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
+and using `--jars <path to
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.1?-*.*.*-SNAPSHOT.jar`
in the spark-shell command above
+instead of `--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0`. Hudi
also supports scala 2.12. Refer [build with scala
2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
for more info.
Also, we used Spark here to show case the capabilities of Hudi. However, Hudi
can support multiple table types/query types and
diff --git a/website/docs/syncing_metastore.md
b/website/docs/syncing_metastore.md
index f1c1fdc582..1b2baa0f24 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -181,7 +181,6 @@ Assuming the metastore is configured properly, then start
the spark-shell.
```
$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE \
- --packages org.apache.spark:spark-avro_2.11:2.4.4 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```