This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 18f71813c69 [HUDI-7133] Improve dbt example for better guidance
(#10155)
18f71813c69 is described below
commit 18f71813c69ecc10682b4e6e5bc4f5e708de0155
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Nov 22 02:00:27 2023 -0600
[HUDI-7133] Improve dbt example for better guidance (#10155)
---
hudi-examples/hudi-examples-dbt/README.md | 132 +++++++++++++++++++++++-------
1 file changed, 101 insertions(+), 31 deletions(-)
diff --git a/hudi-examples/hudi-examples-dbt/README.md
b/hudi-examples/hudi-examples-dbt/README.md
index 8fe796d37c5..22f74591126 100644
--- a/hudi-examples/hudi-examples-dbt/README.md
+++ b/hudi-examples/hudi-examples-dbt/README.md
@@ -18,45 +18,113 @@
This dbt project transforms demonstrates hudi integration with dbt, it has a
few models to demonstrate the different ways in which you can create hudi
datasets using dbt.
-### What is this repo?
-What this repo _is_:
-- A self-contained playground dbt project, useful for testing out scripts, and
communicating some of the core dbt concepts.
+This directory serves as a self-contained playground dbt project, useful for
testing out scripts, and communicating some of the core dbt concepts.
-### Running this project
-To get up and running with this project:
-1. Install dbt using [these
instructions](https://docs.getdbt.com/docs/installation).
+### Setup
-2. Install [dbt-spark](https://github.com/dbt-labs/dbt-spark) package:
-```bash
-pip install dbt-spark
-```
+Switch working directory and have `python3` installed.
-3. Clone this repo and change into the `hudi-examples-dbt` directory from the
command line:
-```bash
+```shell
cd hudi-examples/hudi-examples-dbt
```
-4. Set up a profile called `spark` to connect to a spark cluster by following
[these
instructions](https://docs.getdbt.com/reference/warehouse-profiles/spark-profile).
If you have access to a data warehouse, you can use those credentials – we
recommend setting your [target
schema](https://docs.getdbt.com/docs/configure-your-profile#section-populating-your-profile)
to be a new schema (dbt will create the schema for you, as long as you have
the right privileges). If you don't have access t [...]
+### Install dbt
+
+Create python virtual environment
([Reference](https://docs.getdbt.com/docs/installation)).
+
+```shell
+python3 -m venv dbt-env
+source dbt-env/bin/activate
+```
+
+We are using `thrift` as the connection method
([Reference](https://docs.getdbt.com/docs/core/connect-data-platform/spark-setup)).
+
+```shell
+python3 -m pip install "dbt-spark[PyHive]"
+```
+
+### Configure dbt for Spark
+
+Set up a profile called `spark` to connect to a spark cluster via thrift
server
([Reference](https://docs.getdbt.com/docs/core/connect-data-platform/spark-setup#thrift)).
+
+```yaml
+spark:
+ target: dev
+ outputs:
+ dev:
+ type: spark
+ method: thrift
+ schema: my_schema
+ host: localhost
+ port: 10000
+ server_side_parameters:
+ "spark.driver.memory": "3g"
+```
+
+_If you have access to a data warehouse, you can use those credentials – we
recommend setting your [target
schema](https://docs.getdbt.com/docs/configure-your-profile#section-populating-your-profile)
to be a new schema (dbt will create the schema for you, as long as you have
the right privileges). If you don't have access to an existing data warehouse,
you can also setup a local postgres database and connect to it in your profile._
+
+### Start Spark Thrift server
+
+> **NOTE** Using these versions
+> - Spark 3.2.3 (with Derby 10.14.2.0)
+> - Hudi 0.14.0
+
+Start a local Derby server
+
+```shell
+export DERBY_VERSION=10.14.2.0
+wget
https://archive.apache.org/dist/db/derby/db-derby-$DERBY_VERSION/db-derby-$DERBY_VERSION-bin.tar.gz
-P /opt/
+tar -xf /opt/db-derby-$DERBY_VERSION-bin.tar.gz -C /opt/
+export DERBY_HOME=/opt/db-derby-$DERBY_VERSION-bin
+$DERBY_HOME/bin/startNetworkServer -h 0.0.0.0
+```
+
+Start a local Thrift server for Spark
+
+```shell
+export SPARK_VERSION=3.2.3
+wget
https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz
-P /opt/
+tar -xf /opt/spark-$SPARK_VERSION-bin-hadoop2.7.tgz -C /opt/
+export SPARK_HOME=/opt/spark-$SPARK_VERSION-bin-hadoop2.7
+
+# install dependencies
+cp $DERBY_HOME/lib/{derby,derbyclient}.jar $SPARK_HOME/jars/
+wget
https://repository.apache.org/content/repositories/releases/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.14.0/hudi-spark3.2-bundle_2.12-0.14.0.jar
-P $SPARK_HOME/jars/
+
+# start Thrift server connecting to Derby as HMS backend
+$SPARK_HOME/sbin/start-thriftserver.sh \
+--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
+--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
\
+--conf spark.sql.warehouse.dir=/tmp/hudi/hive/warehouse \
+--hiveconf hive.metastore.warehouse.dir=/tmp/hudi/hive/warehouse \
+--hiveconf hive.metastore.schema.verification=false \
+--hiveconf datanucleus.schema.autoCreateAll=true \
+--hiveconf
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver \
+--hiveconf
'javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/default;create=true'
+```
-> **NOTE:** You need to include the hudi spark bundle to the spark cluster,
the latest supported version is 0.10.1.
+### Verify dbt setup
-5. Ensure your profile is setup correctly from the command line:
-```bash
+```shell
dbt debug
```
Output of the above command should show this text at the end of the output:
-```bash
+
+```
All checks passed!
```
-6. Run the models:
-```bash
+### Run the models
+
+```shell
dbt run
```
-Output should look like this:
-```bash
+Output should look like this
+
+```
05:47:28 Running with dbt=1.0.0
05:47:28 Found 5 models, 10 tests, 0 snapshots, 0 analyses, 0 macros, 0
operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
05:47:28
@@ -77,12 +145,16 @@ Output should look like this:
05:47:42
05:47:42 Completed successfully
```
-7. Test the output of the models:
-```bash
+
+### Test the output of the models
+
+```shell
dbt test
```
-Output should look like this:
-```bash
+
+Output should look like this
+
+```
05:48:17 Running with dbt=1.0.0
05:48:17 Found 5 models, 10 tests, 0 snapshots, 0 analyses, 0 macros, 0
operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
05:48:17
@@ -116,14 +188,12 @@ Output should look like this:
05:48:26 Done. PASS=10 WARN=0 ERROR=0 SKIP=0 TOTAL=10
```
-8. Generate documentation for the project:
-```bash
-dbt docs generate
-```
+### Generate documentation
-9. View the [documentation](http://127.0.0.1:8080/#!/overview) for the project
after running the following command:
-```bash
+```shell
+dbt docs generate
dbt docs serve
+# then visit http://127.0.0.1:8080/#!/overview
```
---