[hudi] branch asf-site updated: [HUDI-2083] Adding instructions to access S3 in hudi-cli (#4603)

sivabalan Fri, 14 Jan 2022 09:46:54 -0800

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 236248d  [HUDI-2083] Adding instructions to access S3 in hudi-cli 
(#4603)
236248d is described below

commit 236248d5f8474e23f7288ce5005910ab76488dd3
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Fri Jan 14 12:46:28 2022 -0500

    [HUDI-2083] Adding instructions to access S3 in hudi-cli (#4603)
    
    
    Co-authored-by: Y Ethan Guo <[email protected]>
---
 website/docs/cli.md | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/website/docs/cli.md b/website/docs/cli.md
index 64f0c1e..2b9bb60 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -4,9 +4,45 @@ keywords: [hudi, cli]
 last_modified_at: 2021-08-18T15:59:57-04:00
 ---
 
+### Local set up
 Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the 
`basePath` and
 we would need this location in order to connect to a Hudi table. Hudi library 
effectively manages this table internally, using `.hoodie` subfolder to track 
all metadata.
 
+
+### Using Hudi-cli in S3
+If you are using hudi that comes packaged with AWS EMR, you can find 
instructions to use hudi-cli 
[here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-cli.html).
+If you are not using EMR, or would like to use latest hudi-cli from master, 
you can follow the below steps to access S3 dataset in your local environment 
(laptop).  
+
+Build Hudi with corresponding Spark version, for eg, -Dspark3.1.x
+
+Set the following environment variables. 
+```
+export AWS_REGION=us-east-2
+export AWS_ACCESS_KEY_ID=<key_id>
+export AWS_SECRET_ACCESS_KEY=<secret_key>
+export SPARK_HOME=<spark_home>
+```
+Ensure you set the SPARK_HOME to your local spark home compatible to compiled 
hudi spark version above.
+
+Apart from these, we might need to add aws jars to class path so that 
accessing S3 is feasible from local. 
+We need two jars, namely, aws-java-sdk-bundle jar and hadoop-aws jar which you 
can find online.
+For eg:
+```
+wget 
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
 -o /lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar
+wget 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
 -o /lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar
+```
+
+#### Note: These AWS jar versions below are specific to Spark 3.2.0
+```
+export 
CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
+```
+Once these are set, you are good to launch hudi-cli and access S3 dataset. 
+```
+./hudi-cli/hudi-cli.sh
+```
+
+## Using hudi-cli
+
 To initialize a hudi table, use the following command.
 
 ```java

[hudi] branch asf-site updated: [HUDI-2083] Adding instructions to access S3 in hudi-cli (#4603)

Reply via email to