[hudi] branch asf-site updated: [HUDI-5826] Add Hudi CLI docs for GCP Dataproc node (#8008)

yihua Sun, 24 Sep 2023 12:24:15 -0700

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new f18bd02967f [HUDI-5826] Add Hudi CLI docs for GCP Dataproc node (#8008)
f18bd02967f is described below

commit f18bd02967f69e021bc11a49cb232291ba88a917
Author: Pramod Biligiri <[email protected]>
AuthorDate: Mon Sep 25 00:54:00 2023 +0530

    [HUDI-5826] Add Hudi CLI docs for GCP Dataproc node (#8008)
---
 website/docs/cli.md | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/website/docs/cli.md b/website/docs/cli.md
index 4242609a68c..77bd06a9725 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -76,7 +76,43 @@ Once these are set, you are good to launch hudi-cli and 
access S3 dataset.
 ```
 ./hudi-cli/hudi-cli.sh
 ```
+### Using hudi-cli on Google Dataproc
+[Dataproc](https://cloud.google.com/dataproc) is Google's managed service for 
running Apache Hadoop, Apache Spark, 
+Apache Flink, Presto and many other frameworks, including Hudi. If you want to 
run the Hudi CLI on a Dataproc node 
+which has not been launched with Hudi support enabled, you can use the steps 
below:  
 
+These steps use Hudi version 0.13.0. If you want to use a different version 
you will have to edit the below commands 
+appropriately:  
+1. Once you've started the Dataproc cluster, you can ssh into it as follows:
+```
+$ gcloud compute ssh --zone "YOUR_ZONE" "HOSTNAME_OF_MASTER_NODE"  --project 
"YOUR_PROJECT"
+```  
+
+2. Download the Hudi CLI bundle
+```
+wget 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/0.13.0/hudi-cli-bundle_2.12-0.13.0.jar
  
+```
+
+3. Download the Hudi Spark bundle
+```
+wget 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.12/0.13.0/hudi-spark-bundle_2.12-0.13.0.jar
+```     
+
+4. Download the shell script that launches Hudi CLI bundle
+```
+wget 
https://raw.githubusercontent.com/apache/hudi/release-0.13.0/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
+```    
+
+5. Launch Hudi CLI bundle with appropriate environment variables as follows:
+``` 
+CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar 
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-0.13.0.jar 
SPARK_BUNDLE_JAR=hudi-spark-bundle_2.12-0.13.0.jar ./hudi-cli-with-bundle.sh  
+```
+
+6. hudi->connect --path gs://path_to_some_table  
+Metadata for table some_table loaded  
+
+7. hudi:some_table->commits show --limit 5  
+This command should show the recent commits, if the above steps work 
correctly.  
 
 ## Connect to a Kerberized cluster

[hudi] branch asf-site updated: [HUDI-5826] Add Hudi CLI docs for GCP Dataproc node (#8008)

Reply via email to