[hudi] branch asf-site updated: [MINOR] Add faq for overriding Hudi jar in EMR cluster (#3718)

xushiyan Sun, 26 Sep 2021 11:08:49 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 37c9b7b  [MINOR] Add faq for overriding Hudi jar in EMR cluster (#3718)
37c9b7b is described below

commit 37c9b7bb71e7f06fda1cb3a5a4ffe527c2760a0b
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Sun Sep 26 11:08:32 2021 -0700

    [MINOR] Add faq for overriding Hudi jar in EMR cluster (#3718)
---
 website/learn/faq.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/website/learn/faq.md b/website/learn/faq.md
index 51ba060..da5b924 100644
--- a/website/learn/faq.md
+++ b/website/learn/faq.md
@@ -376,6 +376,58 @@ AWS Glue jobs can write, read and update Glue Data Catalog 
for hudi tables. In o
 
 In case if your using either notebooks or Zeppelin through Glue dev-endpoints, 
your script might not be able to integrate with Glue DataCatalog when writing 
to hudi tables.
 
+### How to override Hudi jars in EMR?
+
+If you are looking to override Hudi jars in your EMR clusters one way to 
achieve this is by providing the Hudi jars through a bootstrap script. 
+Here are the example steps for overriding Hudi version 0.7.0 in EMR 0.6.2. 
+
+**Build Hudi Jars:**
+```shell script
+# Git clone
+git clone https://github.com/apache/hudi.git && cd hudi   
+
+# Get version 0.7.0
+git checkout --track origin/release-0.7.0
+
+# Build jars with spark 3.0.0 and scala 2.12 (since emr 6.2.0 uses spark 3 
which requires scala 2.12):
+mvn clean package -DskipTests -Dspark3  -Dscala-2.12 -T 30 
+```
+
+**Copy jars to s3:**
+These are the jars we are interested in after build completes. Copy them to a 
temp location first.
+
+```shell script
+mkdir -p ~/Downloads/hudi-jars
+cp packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.7.0.jar 
~/Downloads/hudi-jars/
+cp packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.7.0.jar 
~/Downloads/hudi-jars/
+cp packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.12-0.7.0.jar 
~/Downloads/hudi-jars/
+cp 
packaging/hudi-timeline-server-bundle/target/hudi-timeline-server-bundle-0.7.0.jar
 ~/Downloads/hudi-jars/
+cp packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.7.0.jar 
~/Downloads/hudi-jars/
+```
+
+Upload  all jars from ~/Downloads/hudi-jars/ to the s3 location 
s3://xxx/yyy/hudi-jars
+
+**Include Hudi jars as part of the emr bootstrap script:**
+Below script downloads Hudi jars from above s3 location. Use this script as 
part `bootstrap-actions` when launching the EMR cluster to install the jars in 
each node.
+
+```shell script
+#!/bin/bash
+sudo mkdir -p /mnt1/hudi-jars
+
+sudo aws s3 cp s3://xxx/yyy/hudi-jars /mnt1/hudi-jars --recursive
+
+# create symlinks
+cd /mnt1/hudi-jars
+sudo ln -sf hudi-hadoop-mr-bundle-0.7.0.jar hudi-hadoop-mr-bundle.jar
+sudo ln -sf hudi-hive-sync-bundle-0.7.0.jar hudi-hive-sync-bundle.jar
+sudo ln -sf hudi-spark-bundle_2.12-0.7.0.jar hudi-spark-bundle.jar
+sudo ln -sf hudi-timeline-server-bundle-0.7.0.jar 
hudi-timeline-server-bundle.jar
+sudo ln -sf hudi-utilities-bundle_2.12-0.7.0.jar hudi-utilities-bundle.jar
+```
+
+**Using the overriden jar in Deltastreamer:**
+When invoking DeltaStreamer specify the above jar location as part of 
spark-submit command.
+
 ### Why partition fields are also stored in parquet files in addition to the 
partition path ?
 
 Hudi supports customizable partition values which could be a derived value of 
another field. Also, storing the partition value only as part of the field 
results in losing type information when queried by various query engines.

[hudi] branch asf-site updated: [MINOR] Add faq for overriding Hudi jar in EMR cluster (#3718)

Reply via email to