(gravitino) branch branch-0.9 updated: [#8391] fix(docs): Fix error in documents about how to use bundle jars for Azure Blob Storage and GCS (#8404)

jshao Tue, 02 Sep 2025 22:58:15 -0700

This is an automated email from the ASF dual-hosted git repository.

jshao pushed a commit to branch branch-0.9
in repository https://gitbox.apache.org/repos/asf/gravitino.git



The following commit(s) were added to refs/heads/branch-0.9 by this push:
     new 01de6310f9 [#8391] fix(docs): Fix error in documents about how to use 
bundle jars for Azure Blob Storage and GCS (#8404)
01de6310f9 is described below

commit 01de6310f9cd47f0626283c8ca67823e78d1ab6b
Author: github-actions[bot] 
<41898282+github-actions[bot]@users.noreply.github.com>
AuthorDate: Wed Sep 3 13:58:04 2025 +0800

    [#8391] fix(docs): Fix error in documents about how to use bundle jars for 
Azure Blob Storage and GCS (#8404)
    
    ### What changes were proposed in this pull request?
    
    Add more details and fix errors about how to use PySpark to access Azure
    Blob Storage and GCS via GVFS
    
    ### Why are the changes needed?
    
    For better user experience.
    
    Fix: #8391
    
    ### Does this PR introduce _any_ user-facing change?
    
    N/A.
    
    ### How was this patch tested?
    
    Test locally.
    
    Co-authored-by: Mini Yu <[email protected]>
---
 docs/hadoop-catalog-with-adls.md | 27 +++++++++++++++++++++------
 docs/hadoop-catalog-with-gcs.md  | 22 ++++++++++++++++++++--
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md
index c75e85be01..81474a937e 100644
--- a/docs/hadoop-catalog-with-adls.md
+++ b/docs/hadoop-catalog-with-adls.md
@@ -308,7 +308,7 @@ Or use the bundle jar with Hadoop environment if there is 
no Hadoop environment:
 
 ### Using Spark to access the fileset
 
-The following code snippet shows how to use **PySpark 3.1.3 with Hadoop 
environment(Hadoop 3.2.0)** to access the fileset:
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop 
environment(Hadoop 3.2.0)** and JDK8 to access the fileset:
 
 Before running the following code, you need to install required packages:
 
@@ -328,8 +328,10 @@ metalake_name = "test"
 catalog_name = "your_adls_catalog"
 schema_name = "your_adls_schema"
 fileset_name = "your_adls_fileset"
-
+# JDK8
 os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
 --master local[1] pyspark-shell"
+# JDK17
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
 --conf 
\"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" 
--conf 
\"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\"
 --master local[1] pyspark-shell"
 spark = SparkSession.builder
     .appName("adls_fileset_test")
     .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
@@ -337,7 +339,7 @@ spark = SparkSession.builder
     .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
     .config("spark.hadoop.fs.gravitino.client.metalake", "test")
     .config("spark.hadoop.azure-storage-account-name", "azure_account_name")
-    .config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+    .config("spark.hadoop.azure-storage-account-key", "azure_account_key")
     .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", 
"true")
     .config("spark.driver.memory", "2g")
     .config("spark.driver.port", "2048")
@@ -361,11 +363,24 @@ If your Spark **without Hadoop environment**, you can use 
the following code sni
 
 os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
 --master local[1] pyspark-shell"
 ```
+If Spark can't start with the above configuration (no Hadoop environment 
available and use bundle jar), you can try to set the jars to the classpath 
directly:
 
-- 
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
 is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` 
jar.
-- 
[`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure)
 is a condensed version of the Gravitino ADLS bundle jar without Hadoop 
environment and `hadoop-azure` jar.
-- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the 
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
+```python
+jars_path = (
+    "/path/to/gravitino-azure-bundle-{gravitino-version}.jar:"
+    "/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar"
+)
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = (
+    f'--conf "spark.driver.extraClassPath={jars_path}" '
+    f'--conf "spark.executor.extraClassPath={jars_path}" '
+    '--master local[1] pyspark-shell'
+)
+```
 
+- 
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
 is the Gravitino ADLS jar with Hadoop environment(3.3.1), `hadoop-azure.jar` 
and all packages needed to access ADLS.
+- 
[`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure)
 is a condensed version of the Gravitino ADLS bundle jar without Hadoop 
environment and `hadoop-azure.jar`.
+- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the 
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
 
 Please choose the correct jar according to your environment.
 
diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md
index 92e931cfd2..cf65e070bf 100644
--- a/docs/hadoop-catalog-with-gcs.md
+++ b/docs/hadoop-catalog-with-gcs.md
@@ -116,7 +116,7 @@ gcs_properties = 
gravitino_client.create_catalog(name="test_catalog",
 
 ### Step2: Create a schema
 
-Once you have created a Hadoop catalog with GCS, you can create a schema. The 
following example shows how to create a schema:
+Once you’ve created a Fileset catalog with GCS, you can create a schema. The 
following example shows how to create a schema:
 
 <Tabs groupId="language" queryString>
 <TabItem value="shell" label="Shell">
@@ -299,7 +299,7 @@ Or use the bundle jar with Hadoop environment if there is 
no Hadoop environment:
 
 ### Using Spark to access the fileset
 
-The following code snippet shows how to use **PySpark 3.1.3 with Hadoop 
environment(Hadoop 3.2.0)** to access the fileset:
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop 
environment(Hadoop 3.2.0)** and JDK8 to access the fileset:
 
 Before running the following code, you need to install required packages:
 
@@ -320,7 +320,10 @@ catalog_name = "your_gcs_catalog"
 schema_name = "your_gcs_schema"
 fileset_name = "your_gcs_fileset"
 
+# JDK8
 os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
 --master local[1] pyspark-shell"
+# JDK17
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
 --conf 
\"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" 
--conf 
\"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\"
 --master local[1] pyspark-shell"
 spark = SparkSession.builder
     .appName("gcs_fielset_test")
     .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
@@ -351,6 +354,21 @@ If your Spark **without Hadoop environment**, you can use 
the following code sni
 os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
 --master local[1] pyspark-shell"
 ```
 
+If Spark can't start with the above configuration (no Hadoop environment 
available and use bundle jar), you can try to set the jars to the classpath 
directly:
+
+```python
+jars_path = (
+    "/path/to/gravitino-gcp-bundle-{gravitino-version}.jar:"
+    "/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar"
+)
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = (
+    f'--conf "spark.driver.extraClassPath={jars_path}" '
+    f'--conf "spark.executor.extraClassPath={jars_path}" '
+    '--master local[1] pyspark-shell'
+)
+```
+
 - 
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
 is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`.
 - 
[`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp)
 is a condensed version of the Gravitino GCP bundle jar without Hadoop 
environment and 
[`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar)

(gravitino) branch branch-0.9 updated: [#8391] fix(docs): Fix error in documents about how to use bundle jars for Azure Blob Storage and GCS (#8404)

Reply via email to