This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 37c9b7b  [MINOR] Add faq for overriding Hudi jar in EMR cluster (#3718)
37c9b7b is described below

commit 37c9b7bb71e7f06fda1cb3a5a4ffe527c2760a0b
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Sun Sep 26 11:08:32 2021 -0700

    [MINOR] Add faq for overriding Hudi jar in EMR cluster (#3718)
---
 website/learn/faq.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/website/learn/faq.md b/website/learn/faq.md
index 51ba060..da5b924 100644
--- a/website/learn/faq.md
+++ b/website/learn/faq.md
@@ -376,6 +376,58 @@ AWS Glue jobs can write, read and update Glue Data Catalog 
for hudi tables. In o
 
 In case if your using either notebooks or Zeppelin through Glue dev-endpoints, 
your script might not be able to integrate with Glue DataCatalog when writing 
to hudi tables.
 
+### How to override Hudi jars in EMR?
+
+If you are looking to override Hudi jars in your EMR clusters one way to 
achieve this is by providing the Hudi jars through a bootstrap script. 
+Here are the example steps for overriding Hudi version 0.7.0 in EMR 0.6.2. 
+
+**Build Hudi Jars:**
+```shell script
+# Git clone
+git clone https://github.com/apache/hudi.git && cd hudi   
+
+# Get version 0.7.0
+git checkout --track origin/release-0.7.0
+
+# Build jars with spark 3.0.0 and scala 2.12 (since emr 6.2.0 uses spark 3 
which requires scala 2.12):
+mvn clean package -DskipTests -Dspark3  -Dscala-2.12 -T 30 
+```
+
+**Copy jars to s3:**
+These are the jars we are interested in after build completes. Copy them to a 
temp location first.
+
+```shell script
+mkdir -p ~/Downloads/hudi-jars
+cp packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.7.0.jar 
~/Downloads/hudi-jars/
+cp packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.7.0.jar 
~/Downloads/hudi-jars/
+cp packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.12-0.7.0.jar 
~/Downloads/hudi-jars/
+cp 
packaging/hudi-timeline-server-bundle/target/hudi-timeline-server-bundle-0.7.0.jar
 ~/Downloads/hudi-jars/
+cp packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.7.0.jar 
~/Downloads/hudi-jars/
+```
+
+Upload  all jars from ~/Downloads/hudi-jars/ to the s3 location 
s3://xxx/yyy/hudi-jars
+
+**Include Hudi jars as part of the emr bootstrap script:**
+Below script downloads Hudi jars from above s3 location. Use this script as 
part `bootstrap-actions` when launching the EMR cluster to install the jars in 
each node.
+
+```shell script
+#!/bin/bash
+sudo mkdir -p /mnt1/hudi-jars
+
+sudo aws s3 cp s3://xxx/yyy/hudi-jars /mnt1/hudi-jars --recursive
+
+# create symlinks
+cd /mnt1/hudi-jars
+sudo ln -sf hudi-hadoop-mr-bundle-0.7.0.jar hudi-hadoop-mr-bundle.jar
+sudo ln -sf hudi-hive-sync-bundle-0.7.0.jar hudi-hive-sync-bundle.jar
+sudo ln -sf hudi-spark-bundle_2.12-0.7.0.jar hudi-spark-bundle.jar
+sudo ln -sf hudi-timeline-server-bundle-0.7.0.jar 
hudi-timeline-server-bundle.jar
+sudo ln -sf hudi-utilities-bundle_2.12-0.7.0.jar hudi-utilities-bundle.jar
+```
+
+**Using the overriden jar in Deltastreamer:**
+When invoking DeltaStreamer specify the above jar location as part of 
spark-submit command.
+
 ### Why partition fields are also stored in parquet files in addition to the 
partition path ?
 
 Hudi supports customizable partition values which could be a derived value of 
another field. Also, storing the partition value only as part of the field 
results in losing type information when queried by various query engines.

Reply via email to