(gravitino) branch main updated: [#5064] docs(spark): support s3 storage for spark hive connector (#5065)

yuqi4733 Wed, 09 Oct 2024 19:19:43 -0700

This is an automated email from the ASF dual-hosted git repository.

yuqi4733 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gravitino.git



The following commit(s) were added to refs/heads/main by this push:
     new 4ff652a4c [#5064] docs(spark): support s3 storage for spark hive 
connector (#5065)
4ff652a4c is described below

commit 4ff652a4cc3b6ddc67b3f1da0387ad504fbdd7b5
Author: FANNG <[email protected]>
AuthorDate: Thu Oct 10 10:19:19 2024 +0800

    [#5064] docs(spark): support s3 storage for spark hive connector (#5065)
    
    ### What changes were proposed in this pull request?
    
    add document about support s3 storage for spark hive connector
    
    ### Why are the changes needed?
    
    Fix: #5064
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    
    create table and query tables on hive catalog with s3
---
 docs/hive-catalog-with-s3.md               | 24 +++++++++++-------------
 docs/spark-connector/spark-catalog-hive.md |  9 ++++++++-
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/docs/hive-catalog-with-s3.md b/docs/hive-catalog-with-s3.md
index 2275fc301..0eb332eb9 100644
--- a/docs/hive-catalog-with-s3.md
+++ b/docs/hive-catalog-with-s3.md
@@ -184,22 +184,20 @@ This command shows the creation details of the database 
hive_schema, including i
 
 To access S3-stored tables using Spark, you need to configure the SparkSession 
appropriately. Below is an example of how to set up the SparkSession with the 
necessary S3 configurations:
 
-
-
 ```java
   SparkSession sparkSession =
         SparkSession.builder()
-            .master("local[1]")
-            .appName("Hive Catalog integration test")
-            .config("hive.metastore.uris", HIVE_METASTORE_URIS)
-            .config("spark.hadoop.fs.s3a.access.key", accessKey)
-            .config("spark.hadoop.fs.s3a.secret.key", secretKey)
-            .config("spark.hadoop.fs.s3a.endpoint", getS3Endpoint)
-            .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")
-            .config("spark.hadoop.fs.s3a.path.style.access", "true")
-            .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
+            .config("spark.plugins", 
"org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin")
+            .config("spark.sql.gravitino.uri", "http://localhost:8090";)
+            .config("spark.sql.gravitino.metalake", "xx")
+            .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.access.key", 
accessKey)
+            .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.secret.key", 
secretKey)
+            .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.endpoint", 
getS3Endpoint)
+            .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")
+            
.config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.path.style.access", 
"true")
+            
.config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.connection.ssl.enabled", 
"false")
             .config(
-                "spark.hadoop.fs.s3a.aws.credentials.provider",
+                
"spark.sql.catalog.{hive_catalog_name}.fs.s3a.aws.credentials.provider",
                 "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
             .config("spark.sql.storeAssignmentPolicy", "LEGACY")
             .config("mapreduce.input.fileinputformat.input.dir.recursive", 
"true")
@@ -210,7 +208,7 @@ To access S3-stored tables using Spark, you need to 
configure the SparkSession a
 ```
 
 :::Note
-Please ensure that the necessary S3-related JAR files are included in the 
Spark classpath. If the JARs are missing, Spark will not be able to access the 
S3 storage.
+Please download [hadoop aws 
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), [aws 
java sdk 
jar](https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle) and 
place them in the classpath of the Spark. If the JARs are missing, Spark will 
not be able to access the S3 storage.
 :::
 
 By following these instructions, you can effectively manage and access your 
S3-stored data through both Hive CLI and Spark, leveraging the capabilities of 
Gravitino for optimal data management.
\ No newline at end of file
diff --git a/docs/spark-connector/spark-catalog-hive.md 
b/docs/spark-connector/spark-catalog-hive.md
index c251de5ab..ba102c072 100644
--- a/docs/spark-connector/spark-catalog-hive.md
+++ b/docs/spark-connector/spark-catalog-hive.md
@@ -70,4 +70,11 @@ Gravitino catalog property names with the prefix 
`spark.bypass.` are passed to S
 
 :::caution
 When using the `spark-sql` shell client, you must explicitly set the 
`spark.bypass.spark.sql.hive.metastore.jars` in the Gravitino Hive catalog 
properties. Replace the default `builtin` value with the appropriate setting 
for your setup.
-:::
\ No newline at end of file
+:::
+
+
+## Storage
+
+### S3
+
+Please refer to [Hive catalog with s3](../hive-catalog-with-s3.md) to set up a 
Hive catalog with s3 storage. To query the data stored in s3, you need to add 
s3 secret to the Spark configuration using 
`spark.sql.catalog.${hive_catalog_name}.fs.s3a.access.key` and 
`spark.sql.catalog.${iceberg_catalog_name}.s3.fs.s3a.secret.key`. Additionally, 
download [hadoop aws 
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), [aws 
java sdk jar](https://mvnrepository.com/artifact/com [...]

(gravitino) branch main updated: [#5064] docs(spark): support s3 storage for spark hive connector (#5065)

Reply via email to