This is an automated email from the ASF dual-hosted git repository.
jshao pushed a commit to branch branch-0.9
in repository https://gitbox.apache.org/repos/asf/gravitino.git
The following commit(s) were added to refs/heads/branch-0.9 by this push:
new 01de6310f9 [#8391] fix(docs): Fix error in documents about how to use
bundle jars for Azure Blob Storage and GCS (#8404)
01de6310f9 is described below
commit 01de6310f9cd47f0626283c8ca67823e78d1ab6b
Author: github-actions[bot]
<41898282+github-actions[bot]@users.noreply.github.com>
AuthorDate: Wed Sep 3 13:58:04 2025 +0800
[#8391] fix(docs): Fix error in documents about how to use bundle jars for
Azure Blob Storage and GCS (#8404)
### What changes were proposed in this pull request?
Add more details and fix errors about how to use PySpark to access Azure
Blob Storage and GCS via GVFS
### Why are the changes needed?
For better user experience.
Fix: #8391
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Test locally.
Co-authored-by: Mini Yu <[email protected]>
---
docs/hadoop-catalog-with-adls.md | 27 +++++++++++++++++++++------
docs/hadoop-catalog-with-gcs.md | 22 ++++++++++++++++++++--
2 files changed, 41 insertions(+), 8 deletions(-)
diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md
index c75e85be01..81474a937e 100644
--- a/docs/hadoop-catalog-with-adls.md
+++ b/docs/hadoop-catalog-with-adls.md
@@ -308,7 +308,7 @@ Or use the bundle jar with Hadoop environment if there is
no Hadoop environment:
### Using Spark to access the fileset
-The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** and JDK8 to access the fileset:
Before running the following code, you need to install required packages:
@@ -328,8 +328,10 @@ metalake_name = "test"
catalog_name = "your_adls_catalog"
schema_name = "your_adls_schema"
fileset_name = "your_adls_fileset"
-
+# JDK8
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
--master local[1] pyspark-shell"
+# JDK17
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
--conf
\"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\"
--conf
\"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\"
--master local[1] pyspark-shell"
spark = SparkSession.builder
.appName("adls_fileset_test")
.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
@@ -337,7 +339,7 @@ spark = SparkSession.builder
.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
.config("spark.hadoop.fs.gravitino.client.metalake", "test")
.config("spark.hadoop.azure-storage-account-name", "azure_account_name")
- .config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+ .config("spark.hadoop.azure-storage-account-key", "azure_account_key")
.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization",
"true")
.config("spark.driver.memory", "2g")
.config("spark.driver.port", "2048")
@@ -361,11 +363,24 @@ If your Spark **without Hadoop environment**, you can use
the following code sni
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
--master local[1] pyspark-shell"
```
+If Spark can't start with the above configuration (no Hadoop environment
available and use bundle jar), you can try to set the jars to the classpath
directly:
--
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure`
jar.
--
[`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure)
is a condensed version of the Gravitino ADLS bundle jar without Hadoop
environment and `hadoop-azure` jar.
-- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
+```python
+jars_path = (
+ "/path/to/gravitino-azure-bundle-{gravitino-version}.jar:"
+ "/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar"
+)
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = (
+ f'--conf "spark.driver.extraClassPath={jars_path}" '
+ f'--conf "spark.executor.extraClassPath={jars_path}" '
+ '--master local[1] pyspark-shell'
+)
+```
+-
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
is the Gravitino ADLS jar with Hadoop environment(3.3.1), `hadoop-azure.jar`
and all packages needed to access ADLS.
+-
[`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure)
is a condensed version of the Gravitino ADLS bundle jar without Hadoop
environment and `hadoop-azure.jar`.
+- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
Please choose the correct jar according to your environment.
diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md
index 92e931cfd2..cf65e070bf 100644
--- a/docs/hadoop-catalog-with-gcs.md
+++ b/docs/hadoop-catalog-with-gcs.md
@@ -116,7 +116,7 @@ gcs_properties =
gravitino_client.create_catalog(name="test_catalog",
### Step2: Create a schema
-Once you have created a Hadoop catalog with GCS, you can create a schema. The
following example shows how to create a schema:
+Once you’ve created a Fileset catalog with GCS, you can create a schema. The
following example shows how to create a schema:
<Tabs groupId="language" queryString>
<TabItem value="shell" label="Shell">
@@ -299,7 +299,7 @@ Or use the bundle jar with Hadoop environment if there is
no Hadoop environment:
### Using Spark to access the fileset
-The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** and JDK8 to access the fileset:
Before running the following code, you need to install required packages:
@@ -320,7 +320,10 @@ catalog_name = "your_gcs_catalog"
schema_name = "your_gcs_schema"
fileset_name = "your_gcs_fileset"
+# JDK8
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
--master local[1] pyspark-shell"
+# JDK17
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
--conf
\"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\"
--conf
\"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\"
--master local[1] pyspark-shell"
spark = SparkSession.builder
.appName("gcs_fielset_test")
.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
@@ -351,6 +354,21 @@ If your Spark **without Hadoop environment**, you can use
the following code sni
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
--master local[1] pyspark-shell"
```
+If Spark can't start with the above configuration (no Hadoop environment
available and use bundle jar), you can try to set the jars to the classpath
directly:
+
+```python
+jars_path = (
+ "/path/to/gravitino-gcp-bundle-{gravitino-version}.jar:"
+ "/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar"
+)
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = (
+ f'--conf "spark.driver.extraClassPath={jars_path}" '
+ f'--conf "spark.executor.extraClassPath={jars_path}" '
+ '--master local[1] pyspark-shell'
+)
+```
+
-
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`.
-
[`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp)
is a condensed version of the Gravitino GCP bundle jar without Hadoop
environment and
[`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar)