This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new c53ebc29279 [HUDI-9116] Update Hudi CLI public documentation (#12923)
c53ebc29279 is described below
commit c53ebc292793a4873ff88c88a5435a4a6b9dfc53
Author: Mansi Patel <[email protected]>
AuthorDate: Mon Mar 10 14:42:03 2025 -0700
[HUDI-9116] Update Hudi CLI public documentation (#12923)
* [DOCS] Update Hudi CLI doc
* Minor change
* Revise Hudi cli public doc
* Update hudi 1.0.0 and 1.0.1 public docs for hudi cli
* Update cli.md
* Update cli.md
* Update cli.md
---------
Co-authored-by: Y Ethan Guo <[email protected]>
---
website/docs/cli.md | 39 ++++++++++++++++-------------
website/versioned_docs/version-1.0.0/cli.md | 22 ++++++++--------
website/versioned_docs/version-1.0.1/cli.md | 22 ++++++++--------
3 files changed, 44 insertions(+), 39 deletions(-)
diff --git a/website/docs/cli.md b/website/docs/cli.md
index fbae38e18b7..55cc369d21c 100644
--- a/website/docs/cli.md
+++ b/website/docs/cli.md
@@ -5,13 +5,15 @@ last_modified_at: 2021-08-18T15:59:57-04:00
---
### Local set up
-Once hudi has been built, the shell can be fired by via `cd hudi-cli &&
./hudi-cli.sh`.
+Once hudi has been built, the shell can be fired by via `cd
packaging/hudi-cli-bundle && hudi-cli-with-bundle.sh` or
`packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh`.
-### Hudi CLI Bundle setup
-In release `0.13.0` we have now added another way of launching the `hudi cli`,
which is using the `hudi-cli-bundle`.
+### Hudi CLI setup
+In release `0.13.0` we have now added new way of launching the `hudi cli`,
which is using the `hudi-cli-bundle` script.
-There are a couple of requirements when using this approach such as having
`spark` installed locally on your machine.
-It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.3.1-bin-hadoop2.tgz` from
https://archive.apache.org/dist/spark/.
+#### Note: The traditional `hudi-cli.sh` script has been deprecated and
replaced with `hudi-cli-with-bundle.sh` from `1.1.0` release onwards. Users
should migrate to the new bundled CLI script `hudi-cli-with-bundle.sh` for
better compatibility and ease of use.
+
+There are a couple of requirements such as having `spark` installed locally on
your machine.
+It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.5.4-bin-hadoop3.tgz` from
https://archive.apache.org/dist/spark/.
We also recommend you set an env variable `$SPARK_HOME` to the path of where
spark is installed on your machine.
One important thing to note is that the `hudi-spark-bundle` should also be
present when using the `hudi-cli-bundle`.
To provide the locations of these bundle jars you can set them in your shell
like so:
@@ -48,39 +50,42 @@ we would need this location in order to connect to a Hudi
table. Hudi library ef
If you are using hudi that comes packaged with AWS EMR, you can find
instructions to use hudi-cli
[here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-cli.html).
If you are not using EMR, or would like to use latest hudi-cli from master,
you can follow the below steps to access S3 dataset in your local environment
(laptop).
-Build Hudi with corresponding Spark version, for eg, -Dspark3.1.x
+Build Hudi with corresponding Spark version, for eg, -Dspark3.5
Set the following environment variables.
```
export AWS_REGION=us-east-2
export AWS_ACCESS_KEY_ID=<key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>
+
export SPARK_HOME=<spark_home>
+export CLI_BUNDLE_JAR=<cli-bundle-jar-to-use>
+export SPARK_BUNDLE_JAR=<spark-bundle-jar-to-use>
```
-Ensure you set the SPARK_HOME to your local spark home compatible to compiled
hudi spark version above.
+Ensure you set the SPARK_HOME to your local spark home compatible to compiled
hudi spark version above. One important thing to note is that the
`hudi-spark-bundle` should also be present when using the `hudi-cli-bundle`.
Apart from these, we might need to add aws jars to class path so that
accessing S3 is feasible from local.
We need two jars, namely, aws-java-sdk-bundle jar and hadoop-aws jar which you
can find online.
For eg:
```
-wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
-o /lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar
-wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
-o /lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar
+wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
-o /lib/spark-3.5.4-bin-hadoop3/jars/hadoop-aws-3.3.4.jar
+wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
-o /lib/spark-3.5.4-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar
```
-#### Note: These AWS jar versions below are specific to Spark 3.2.0
+#### Note: These AWS jar versions below are specific to Spark 3.5.4 and Hadoop
3.3.4
```
-export
CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
+export
CLIENT_JAR=/lib/spark-3.5.4-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar:/lib/spark-3.5.4-bin-hadoop3/jars/hadoop-aws-3.3.4.jar
```
Once these are set, you are good to launch hudi-cli and access S3 dataset.
```
-./hudi-cli/hudi-cli.sh
+./packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
```
### Using hudi-cli on Google Dataproc
[Dataproc](https://cloud.google.com/dataproc) is Google's managed service for
running Apache Hadoop, Apache Spark,
Apache Flink, Presto and many other frameworks, including Hudi. If you want to
run the Hudi CLI on a Dataproc node
which has not been launched with Hudi support enabled, you can use the steps
below:
-These steps use Hudi version 0.13.0. If you want to use a different version
you will have to edit the below commands
+These steps use Hudi version 1.1.0. If you want to use a different version you
will have to edit the below commands
appropriately:
1. Once you've started the Dataproc cluster, you can ssh into it as follows:
```
@@ -89,22 +94,22 @@ $ gcloud compute ssh --zone "YOUR_ZONE"
"HOSTNAME_OF_MASTER_NODE" --project "YO
2. Download the Hudi CLI bundle
```
-wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/0.13.0/hudi-cli-bundle_2.12-0.13.0.jar
+wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/1.1.0/hudi-cli-bundle_2.12-1.1.0.jar
```
3. Download the Hudi Spark bundle
```
-wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.12/0.13.0/hudi-spark-bundle_2.12-0.13.0.jar
+wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.1.0/hudi-spark3.5-bundle_2.12-1.1.0.jar
```
4. Download the shell script that launches Hudi CLI bundle
```
-wget
https://raw.githubusercontent.com/apache/hudi/release-0.13.0/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
+wget
https://raw.githubusercontent.com/apache/hudi/release-1.1.0/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
```
5. Launch Hudi CLI bundle with appropriate environment variables as follows:
```
-CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-0.13.0.jar
SPARK_BUNDLE_JAR=hudi-spark-bundle_2.12-0.13.0.jar ./hudi-cli-with-bundle.sh
+CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-1.1.0.jar
SPARK_BUNDLE_JAR=hudi-spark3.5-bundle_2.12-1.1.0.jar ./hudi-cli-with-bundle.sh
```
6. hudi->connect --path gs://path_to_some_table
diff --git a/website/versioned_docs/version-1.0.0/cli.md
b/website/versioned_docs/version-1.0.0/cli.md
index def32b11a8e..b25b30b0e8e 100644
--- a/website/versioned_docs/version-1.0.0/cli.md
+++ b/website/versioned_docs/version-1.0.0/cli.md
@@ -11,7 +11,7 @@ Once hudi has been built, the shell can be fired by via `cd
hudi-cli && ./hudi-
In release `0.13.0` we have now added another way of launching the `hudi cli`,
which is using the `hudi-cli-bundle`.
There are a couple of requirements when using this approach such as having
`spark` installed locally on your machine.
-It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.3.1-bin-hadoop2.tgz` from
https://archive.apache.org/dist/spark/.
+It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.5.3-bin-hadoop3.tgz` from
https://archive.apache.org/dist/spark/.
We also recommend you set an env variable `$SPARK_HOME` to the path of where
spark is installed on your machine.
One important thing to note is that the `hudi-spark-bundle` should also be
present when using the `hudi-cli-bundle`.
To provide the locations of these bundle jars you can set them in your shell
like so:
@@ -48,7 +48,7 @@ we would need this location in order to connect to a Hudi
table. Hudi library ef
If you are using hudi that comes packaged with AWS EMR, you can find
instructions to use hudi-cli
[here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-cli.html).
If you are not using EMR, or would like to use latest hudi-cli from master,
you can follow the below steps to access S3 dataset in your local environment
(laptop).
-Build Hudi with corresponding Spark version, for eg, -Dspark3.1.x
+Build Hudi with corresponding Spark version, for eg, -Dspark3.5
Set the following environment variables.
```
@@ -63,13 +63,13 @@ Apart from these, we might need to add aws jars to class
path so that accessing
We need two jars, namely, aws-java-sdk-bundle jar and hadoop-aws jar which you
can find online.
For eg:
```
-wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
-o /lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar
-wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
-o /lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar
+wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
-o /lib/spark-3.5.3-bin-hadoop3/jars/hadoop-aws-3.3.4.jar
+wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
-o /lib/spark-3.5.3-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar
```
-#### Note: These AWS jar versions below are specific to Spark 3.2.0
+#### Note: These AWS jar versions below are specific to Spark 3.5.3 and Hadoop
3.3.4
```
-export
CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
+export
CLIENT_JAR=/lib/spark-3.5.3-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar:/lib/spark-3.5.3-bin-hadoop3/jars/hadoop-aws-3.3.4.jar
```
Once these are set, you are good to launch hudi-cli and access S3 dataset.
```
@@ -80,7 +80,7 @@ Once these are set, you are good to launch hudi-cli and
access S3 dataset.
Apache Flink, Presto and many other frameworks, including Hudi. If you want to
run the Hudi CLI on a Dataproc node
which has not been launched with Hudi support enabled, you can use the steps
below:
-These steps use Hudi version 0.13.0. If you want to use a different version
you will have to edit the below commands
+These steps use Hudi version 1.0.0. If you want to use a different version you
will have to edit the below commands
appropriately:
1. Once you've started the Dataproc cluster, you can ssh into it as follows:
```
@@ -89,22 +89,22 @@ $ gcloud compute ssh --zone "YOUR_ZONE"
"HOSTNAME_OF_MASTER_NODE" --project "YO
2. Download the Hudi CLI bundle
```
-wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/0.13.0/hudi-cli-bundle_2.12-0.13.0.jar
+wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/1.0.0/hudi-cli-bundle_2.12-1.0.0.jar
```
3. Download the Hudi Spark bundle
```
-wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.12/0.13.0/hudi-spark-bundle_2.12-0.13.0.jar
+wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.0.0/hudi-spark3.5-bundle_2.12-1.0.0.jar
```
4. Download the shell script that launches Hudi CLI bundle
```
-wget
https://raw.githubusercontent.com/apache/hudi/release-0.13.0/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
+wget
https://raw.githubusercontent.com/apache/hudi/release-1.0.0/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
```
5. Launch Hudi CLI bundle with appropriate environment variables as follows:
```
-CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-0.13.0.jar
SPARK_BUNDLE_JAR=hudi-spark-bundle_2.12-0.13.0.jar ./hudi-cli-with-bundle.sh
+CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-1.0.0.jar
SPARK_BUNDLE_JAR=hudi-spark3.5-bundle_2.12-1.0.0.jar ./hudi-cli-with-bundle.sh
```
6. hudi->connect --path gs://path_to_some_table
diff --git a/website/versioned_docs/version-1.0.1/cli.md
b/website/versioned_docs/version-1.0.1/cli.md
index fbae38e18b7..75aa01e4de0 100644
--- a/website/versioned_docs/version-1.0.1/cli.md
+++ b/website/versioned_docs/version-1.0.1/cli.md
@@ -11,7 +11,7 @@ Once hudi has been built, the shell can be fired by via `cd
hudi-cli && ./hudi-
In release `0.13.0` we have now added another way of launching the `hudi cli`,
which is using the `hudi-cli-bundle`.
There are a couple of requirements when using this approach such as having
`spark` installed locally on your machine.
-It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.3.1-bin-hadoop2.tgz` from
https://archive.apache.org/dist/spark/.
+It is required to use a spark distribution with hadoop dependencies packaged
such as `spark-3.5.4-bin-hadoop3.tgz` from
https://archive.apache.org/dist/spark/.
We also recommend you set an env variable `$SPARK_HOME` to the path of where
spark is installed on your machine.
One important thing to note is that the `hudi-spark-bundle` should also be
present when using the `hudi-cli-bundle`.
To provide the locations of these bundle jars you can set them in your shell
like so:
@@ -48,7 +48,7 @@ we would need this location in order to connect to a Hudi
table. Hudi library ef
If you are using hudi that comes packaged with AWS EMR, you can find
instructions to use hudi-cli
[here](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-cli.html).
If you are not using EMR, or would like to use latest hudi-cli from master,
you can follow the below steps to access S3 dataset in your local environment
(laptop).
-Build Hudi with corresponding Spark version, for eg, -Dspark3.1.x
+Build Hudi with corresponding Spark version, for eg, -Dspark3.5
Set the following environment variables.
```
@@ -63,13 +63,13 @@ Apart from these, we might need to add aws jars to class
path so that accessing
We need two jars, namely, aws-java-sdk-bundle jar and hadoop-aws jar which you
can find online.
For eg:
```
-wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
-o /lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar
-wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
-o /lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar
+wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
-o /lib/spark-3.5.4-bin-hadoop3/jars/hadoop-aws-3.3.4.jar
+wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
-o /lib/spark-3.5.4-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar
```
-#### Note: These AWS jar versions below are specific to Spark 3.2.0
+#### Note: These AWS jar versions below are specific to Spark 3.5.4 and Hadoop
3.3.4
```
-export
CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
+export
CLIENT_JAR=/lib/spark-3.5.4-bin-hadoop3/jars/aws-java-sdk-bundle-1.12.262.jar:/lib/spark-3.5.4-bin-hadoop3/jars/hadoop-aws-3.3.4.jar
```
Once these are set, you are good to launch hudi-cli and access S3 dataset.
```
@@ -80,7 +80,7 @@ Once these are set, you are good to launch hudi-cli and
access S3 dataset.
Apache Flink, Presto and many other frameworks, including Hudi. If you want to
run the Hudi CLI on a Dataproc node
which has not been launched with Hudi support enabled, you can use the steps
below:
-These steps use Hudi version 0.13.0. If you want to use a different version
you will have to edit the below commands
+These steps use Hudi version 1.0.1. If you want to use a different version you
will have to edit the below commands
appropriately:
1. Once you've started the Dataproc cluster, you can ssh into it as follows:
```
@@ -89,22 +89,22 @@ $ gcloud compute ssh --zone "YOUR_ZONE"
"HOSTNAME_OF_MASTER_NODE" --project "YO
2. Download the Hudi CLI bundle
```
-wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/0.13.0/hudi-cli-bundle_2.12-0.13.0.jar
+wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-cli-bundle_2.12/1.0.1/hudi-cli-bundle_2.12-1.0.1.jar
```
3. Download the Hudi Spark bundle
```
-wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.12/0.13.0/hudi-spark-bundle_2.12-0.13.0.jar
+wget
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.5-bundle_2.12/1.0.1/hudi-spark3.5-bundle_2.12-1.0.1.jar
```
4. Download the shell script that launches Hudi CLI bundle
```
-wget
https://raw.githubusercontent.com/apache/hudi/release-0.13.0/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
+wget
https://raw.githubusercontent.com/apache/hudi/release-1.0.1/packaging/hudi-cli-bundle/hudi-cli-with-bundle.sh
```
5. Launch Hudi CLI bundle with appropriate environment variables as follows:
```
-CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-0.13.0.jar
SPARK_BUNDLE_JAR=hudi-spark-bundle_2.12-0.13.0.jar ./hudi-cli-with-bundle.sh
+CLIENT_JAR=$DATAPROC_DIR/lib/gcs-connector.jar
CLI_BUNDLE_JAR=hudi-cli-bundle_2.12-1.0.1.jar
SPARK_BUNDLE_JAR=hudi-spark3.5-bundle_2.12-1.0.1.jar ./hudi-cli-with-bundle.sh
```
6. hudi->connect --path gs://path_to_some_table