[hudi] branch asf-site updated: [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim bundles (#5454)

xushiyan Fri, 29 Apr 2022 02:42:39 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 85c72daf0b [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, 
and utilities-slim bundles (#5454)
85c72daf0b is described below

commit 85c72daf0b3df88d3556f51c921bed0485495e05
Author: Y Ethan Guo <[email protected]>
AuthorDate: Fri Apr 29 02:42:27 2022 -0700

    [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim 
bundles (#5454)
---
 website/docs/deployment.md           | 15 +++++--
 website/docs/docker_demo.md          | 12 ++----
 website/docs/hoodie_deltastreamer.md |  8 +++-
 website/docs/quick-start-guide.md    | 76 +++++++++++++++++++++---------------
 website/docs/syncing_metastore.md    |  1 -
 5 files changed, 67 insertions(+), 45 deletions(-)

diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 739480205d..a4a57fb6b0 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -25,14 +25,23 @@ With Merge_On_Read Table, Hudi ingestion needs to also take 
care of compacting d
 
 ### DeltaStreamer
 
-[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone 
utility to incrementally pull upstream changes from varied sources such as DFS, 
Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark 
application in 2 modes.
+[DeltaStreamer](/docs/hoodie_deltastreamer#deltastreamer) is the standalone 
utility to incrementally pull upstream changes 
+from varied sources such as DFS, Kafka and DB Changelogs and ingest them to 
hudi tables.  It runs as a spark application in two modes.
+
+To use DeltaStreamer in Spark, the `hudi-utilities-bundle` is required, by 
adding
+`--packages org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0` to the 
`spark-submit` command. From 0.11.0 release, we start
+to provide a new `hudi-utilities-slim-bundle` which aims to exclude 
dependencies that can cause conflicts and compatibility
+issues with different versions of Spark.  The `hudi-utilities-slim-bundle` 
should be used along with a Hudi Spark bundle 
+corresponding to the Spark version used, e.g., 
+`--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.11.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0`,
+if using `hudi-utilities-bundle` solely in Spark encounters compatibility 
issues.
 
  - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion 
round which includes incrementally pulling events from upstream sources and 
ingesting them to hudi table. Background operations like cleaning old file 
versions and archiving hoodie timeline are automatically executed as part of 
the run. For Merge-On-Read tables, Compaction is also run inline as part of 
ingestion unless disabled by passing the flag "--disable-compaction". By 
default, Compaction is run inline for eve [...]
 
 Here is an example invocation for reading from kafka topic in a single-run 
mode and writing to Merge On Read table type in a yarn cluster.
 
 ```java
-[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
+[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0 \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
@@ -80,7 +89,7 @@ Here is an example invocation for reading from kafka topic in 
a single-run mode
 Here is an example invocation for reading from kafka topic in a continuous 
mode and writing to Merge On Read table type in a yarn cluster.
 
 ```java
-[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 \
+[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.11.0 \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
diff --git a/website/docs/docker_demo.md b/website/docs/docker_demo.md
index eeccf117aa..26d41251bc 100644
--- a/website/docs/docker_demo.md
+++ b/website/docs/docker_demo.md
@@ -391,8 +391,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --deploy-mode client \
   --driver-memory 1G \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 ...
 
 Welcome to
@@ -793,8 +792,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --driver-memory 1G \
   --master local[2] \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 
 # Copy On Write Table:
 
@@ -1050,8 +1048,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --driver-memory 1G \
   --master local[2] \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 
 Welcome to
       ____              __
@@ -1247,8 +1244,7 @@ $SPARK_INSTALL/bin/spark-shell \
   --driver-memory 1G \
   --master local[2] \
   --executor-memory 3G \
-  --num-executors 1 \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4
+  --num-executors 1
 
 # Read Optimized Query
 scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by 
symbol HAVING symbol = 'GOOG'").show(100, false)
diff --git a/website/docs/hoodie_deltastreamer.md 
b/website/docs/hoodie_deltastreamer.md
index ae87c579cd..6f2c80d5cf 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -5,7 +5,7 @@ keywords: [hudi, deltastreamer, hoodiedeltastreamer]
 
 ## DeltaStreamer
 
-The `HoodieDeltaStreamer` utility (part of hudi-utilities-bundle) provides the 
way to ingest from different sources such as DFS or Kafka, with the following 
capabilities.
+The `HoodieDeltaStreamer` utility (part of `hudi-utilities-bundle`) provides 
the way to ingest from different sources such as DFS or Kafka, with the 
following capabilities.
 
 - Exactly once ingestion of new events from Kafka, [incremental 
imports](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
 from Sqoop or output of `HiveIncrementalPuller` or files under a DFS folder
 - Support json, avro or a custom record types for the incoming data
@@ -151,6 +151,12 @@ and then ingest it as follows.
 
 In some cases, you may want to migrate your existing table into Hudi 
beforehand. Please refer to [migration guide](/docs/migration_guide).
 
+From 0.11.0 release, we start to provide a new `hudi-utilities-slim-bundle` 
which aims to exclude dependencies that can
+cause conflicts and compatibility issues with different versions of Spark.  
The `hudi-utilities-slim-bundle` should be
+used along with a Hudi Spark bundle corresponding the Spark version used to 
make utilities work with Spark, e.g.,
+`--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.11.0,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0`,
+if using `hudi-utilities-bundle` solely to run `HoodieDeltaStreamer` in Spark 
encounters compatibility issues.
+
 ### MultiTableDeltaStreamer
 
 `HoodieMultiTableDeltaStreamer`, a wrapper on top of `HoodieDeltaStreamer`, 
enables one to ingest multiple tables at a single go into hudi datasets. 
Currently it only supports sequential processing of tables to be ingested and 
COPY_ON_WRITE storage type. The command line options for 
`HoodieMultiTableDeltaStreamer` are pretty much similar to 
`HoodieDeltaStreamer` with the only exception that you are required to provide 
table wise configs in separate files in a dedicated config folder. The [...]
diff --git a/website/docs/quick-start-guide.md 
b/website/docs/quick-start-guide.md
index 0841351f63..8f77a34dd7 100644
--- a/website/docs/quick-start-guide.md
+++ b/website/docs/quick-start-guide.md
@@ -20,7 +20,7 @@ Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can 
follow instructions [
 
 | Hudi            | Supported Spark 3 version     |
 |:----------------|:------------------------------|
-| 0.11.0          | 3.2.x (default build), 3.1.x  |
+| 0.11.0          | 3.2.x (default build, Spark bundle only), 3.1.x  |
 | 0.10.0          | 3.1.x (default build), 3.0.x  |
 | 0.7.0 - 0.9.0   | 3.0.x                         |
 | 0.6.0 and prior | not supported                 |
@@ -29,6 +29,16 @@ Hudi works with Spark-2.4.3+ & Spark 3.x versions. You can 
follow instructions [
 
 As of 0.9.0 release, Spark SQL DML support has been added and is experimental.
 
+In 0.11.0 release, we add support for Spark 3.2.x and continue the support for 
Spark 3.1.x and Spark 2.4.x.  We officially
+do not provide the support for Spark 3.0.x any more.  To make it easier for 
the users to pick the right Hudi Spark bundle
+in their deployment, we make the following adjustment to the naming of the 
bundles:
+
+- For each supported Spark minor version, there is a corresponding Hudi Spark 
bundle with the major and minor version 
+in the naming, i.e., `hudi-spark3.2-bundle`, `hudi-spark3.1-bundle`, and 
`hudi-spark2.4-bundle`.
+- We encourage users to migrate to using the new bundles above.  We keep the 
bundles with the legacy naming in this
+release, i.e., `hudi-spark3-bundle` targeting at Spark 3.2.x, the latest Spark 
3 version, and `hudi-spark-bundle` for
+Spark 2.4.x.
+
 <Tabs
 defaultValue="scala"
 values={[
@@ -41,24 +51,25 @@ values={[
 From the extracted directory run spark-shell with Hudi as:
 
 ```scala
-// spark-shell for spark 3.1
+// spark-shell for spark 3.2
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
 \
-  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+  --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
+  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+  --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 
-// spark-shell for spark 3.2
+// spark-shell for spark 3.1
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
 \
+  --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
-// spark-shell for spark 2 with scala 2.12
+// spark-shell for spark 2.4 with scala 2.12
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
 \
+  --packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
-// spark-shell for spark 2 with scala 2.11
+// spark-shell for spark 2.4 with scala 2.11
 spark-shell \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
 \
+  --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
@@ -69,24 +80,25 @@ Hudi support using Spark SQL to write and read data with 
the **HoodieSparkSessio
 From the extracted directory run Spark SQL with Hudi as:
 
 ```shell
-# Spark SQL for spark 3.1
-spark-sql --packages 
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
 \
+# Spark SQL for spark 3.2
+spark-sql --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 \
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 
-# Spark SQL for spark 3.0
-spark-sql --packages 
org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
 \
+# Spark SQL for spark 3.1
+spark-sql --packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 
-# Spark SQL for spark 2 with scala 2.11
-spark-sql --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
 \
+# Spark SQL for spark 2.4 with scala 2.11
+spark-sql --packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 
-# Spark SQL for spark 2 with scala 2.12
+# Spark SQL for spark 2.4 with scala 2.12
 spark-sql \
-  --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
 \
+  --packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0 \
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
   --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
 ```
@@ -100,24 +112,25 @@ From the extracted directory run pyspark with Hudi as:
 # pyspark
 export PYSPARK_PYTHON=$(which python3)
 
-# for spark3.1
+# for spark3.2
 pyspark
---packages 
org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2
+--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 
-# for spark3.0
+# for spark3.1
 pyspark
---packages 
org.apache.hudi:hudi-spark3.0.3-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.0.3
+--packages org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
-# for spark2 with scala 2.12
+# for spark2.4 with scala 2.12
 pyspark
---packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.4
+--packages org.apache.hudi:hudi-spark2.4-bundle_2.12:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
-# for spark2 with scala 2.11
+# for spark2.4 with scala 2.11
 pyspark
---packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.10.1,org.apache.spark:spark-avro_2.11:2.4.4
+--packages org.apache.hudi:hudi-spark2.4-bundle_2.11:0.11.0
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```
 
@@ -126,10 +139,9 @@ pyspark
 
 :::note Please note the following
 <ul>
-  <li>spark-avro module needs to be specified in --packages as it is not 
included with spark-shell by default</li>
-  <li>spark-avro and spark versions must match (we have used 3.1.2 for both 
above)</li>
-  <li>we have used hudi-spark-bundle built for scala 2.12 since the spark-avro 
module used also depends on 2.12. 
-         If spark-avro_2.11 is used, correspondingly hudi-spark-bundle_2.11 
needs to be used. </li>
+  <li> For Spark 3.2, the additional spark_catalog config is required: 
+--conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 </li>
+  <li> We have used hudi-spark-bundle built for scala 2.12 since the 
spark-avro module used can also depend on 2.12. </li>
 </ul>
 :::
 
@@ -1175,8 +1187,8 @@ more details please refer to [procedures](procedures).
 ## Where to go from here?
 
 You can also do the quickstart by [building hudi 
yourself](https://github.com/apache/hudi#building-apache-hudi-from-source), 
-and using `--jars <path to 
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.1?-*.*.*-SNAPSHOT.jar`
 in the spark-shell command above
-instead of `--packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1`. 
Hudi also supports scala 2.12. Refer [build with scala 
2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
+and using `--jars <path to 
hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.1?-*.*.*-SNAPSHOT.jar`
 in the spark-shell command above
+instead of `--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0`. Hudi 
also supports scala 2.12. Refer [build with scala 
2.12](https://github.com/apache/hudi#build-with-different-spark-versions)
 for more info.
 
 Also, we used Spark here to show case the capabilities of Hudi. However, Hudi 
can support multiple table types/query types and 
diff --git a/website/docs/syncing_metastore.md 
b/website/docs/syncing_metastore.md
index f1c1fdc582..1b2baa0f24 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -181,7 +181,6 @@ Assuming the metastore is configured properly, then start 
the spark-shell.
 
 ```
 $SPARK_INSTALL/bin/spark-shell   --jars $HUDI_SPARK_BUNDLE \
-  --packages org.apache.spark:spark-avro_2.11:2.4.4 \ 
   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 ```

[hudi] branch asf-site updated: [HUDI-3680][HUDI-3926] Update docs for Spark, utilities, and utilities-slim bundles (#5454)

Reply via email to