This is an automated email from the ASF dual-hosted git repository.
yuqi4733 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gravitino.git
The following commit(s) were added to refs/heads/main by this push:
new 4ff652a4c [#5064] docs(spark): support s3 storage for spark hive
connector (#5065)
4ff652a4c is described below
commit 4ff652a4cc3b6ddc67b3f1da0387ad504fbdd7b5
Author: FANNG <[email protected]>
AuthorDate: Thu Oct 10 10:19:19 2024 +0800
[#5064] docs(spark): support s3 storage for spark hive connector (#5065)
### What changes were proposed in this pull request?
add document about support s3 storage for spark hive connector
### Why are the changes needed?
Fix: #5064
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
create table and query tables on hive catalog with s3
---
docs/hive-catalog-with-s3.md | 24 +++++++++++-------------
docs/spark-connector/spark-catalog-hive.md | 9 ++++++++-
2 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/docs/hive-catalog-with-s3.md b/docs/hive-catalog-with-s3.md
index 2275fc301..0eb332eb9 100644
--- a/docs/hive-catalog-with-s3.md
+++ b/docs/hive-catalog-with-s3.md
@@ -184,22 +184,20 @@ This command shows the creation details of the database
hive_schema, including i
To access S3-stored tables using Spark, you need to configure the SparkSession
appropriately. Below is an example of how to set up the SparkSession with the
necessary S3 configurations:
-
-
```java
SparkSession sparkSession =
SparkSession.builder()
- .master("local[1]")
- .appName("Hive Catalog integration test")
- .config("hive.metastore.uris", HIVE_METASTORE_URIS)
- .config("spark.hadoop.fs.s3a.access.key", accessKey)
- .config("spark.hadoop.fs.s3a.secret.key", secretKey)
- .config("spark.hadoop.fs.s3a.endpoint", getS3Endpoint)
- .config("spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
- .config("spark.hadoop.fs.s3a.path.style.access", "true")
- .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
+ .config("spark.plugins",
"org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin")
+ .config("spark.sql.gravitino.uri", "http://localhost:8090")
+ .config("spark.sql.gravitino.metalake", "xx")
+ .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.access.key",
accessKey)
+ .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.secret.key",
secretKey)
+ .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.endpoint",
getS3Endpoint)
+ .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
+
.config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.path.style.access",
"true")
+
.config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.connection.ssl.enabled",
"false")
.config(
- "spark.hadoop.fs.s3a.aws.credentials.provider",
+
"spark.sql.catalog.{hive_catalog_name}.fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
.config("spark.sql.storeAssignmentPolicy", "LEGACY")
.config("mapreduce.input.fileinputformat.input.dir.recursive",
"true")
@@ -210,7 +208,7 @@ To access S3-stored tables using Spark, you need to
configure the SparkSession a
```
:::Note
-Please ensure that the necessary S3-related JAR files are included in the
Spark classpath. If the JARs are missing, Spark will not be able to access the
S3 storage.
+Please download [hadoop aws
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), [aws
java sdk
jar](https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle) and
place them in the classpath of the Spark. If the JARs are missing, Spark will
not be able to access the S3 storage.
:::
By following these instructions, you can effectively manage and access your
S3-stored data through both Hive CLI and Spark, leveraging the capabilities of
Gravitino for optimal data management.
\ No newline at end of file
diff --git a/docs/spark-connector/spark-catalog-hive.md
b/docs/spark-connector/spark-catalog-hive.md
index c251de5ab..ba102c072 100644
--- a/docs/spark-connector/spark-catalog-hive.md
+++ b/docs/spark-connector/spark-catalog-hive.md
@@ -70,4 +70,11 @@ Gravitino catalog property names with the prefix
`spark.bypass.` are passed to S
:::caution
When using the `spark-sql` shell client, you must explicitly set the
`spark.bypass.spark.sql.hive.metastore.jars` in the Gravitino Hive catalog
properties. Replace the default `builtin` value with the appropriate setting
for your setup.
-:::
\ No newline at end of file
+:::
+
+
+## Storage
+
+### S3
+
+Please refer to [Hive catalog with s3](../hive-catalog-with-s3.md) to set up a
Hive catalog with s3 storage. To query the data stored in s3, you need to add
s3 secret to the Spark configuration using
`spark.sql.catalog.${hive_catalog_name}.fs.s3a.access.key` and
`spark.sql.catalog.${iceberg_catalog_name}.s3.fs.s3a.secret.key`. Additionally,
download [hadoop aws
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), [aws
java sdk jar](https://mvnrepository.com/artifact/com [...]