(gravitino) branch main updated: [#5557] improvement(CI): Add some docs and tests about how to use Azure Blob Storage(ADLS) in Hive (#5558)

jshao Thu, 28 Nov 2024 18:03:02 -0800

This is an automated email from the ASF dual-hosted git repository.

jshao pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gravitino.git



The following commit(s) were added to refs/heads/main by this push:
     new e25a67aac [#5557] improvement(CI): Add some docs and tests about how 
to use Azure Blob Storage(ADLS) in Hive (#5558)
e25a67aac is described below

commit e25a67aac6669efbff77c83b5bb7e143117a04a2
Author: Qi Yu <[email protected]>
AuthorDate: Fri Nov 29 10:02:52 2024 +0800

    [#5557] improvement(CI): Add some docs and tests about how to use Azure 
Blob Storage(ADLS) in Hive (#5558)
    
    ### What changes were proposed in this pull request?
    
    Add some tests to demonstrate how to use ADLS in Hive.
    
    ### Why are the changes needed?
    
    To verify if we can use ADLS in Hive.
    Fix: #5557
    
    ### Does this PR introduce _any_ user-facing change?
    
    N/A
    
    ### How was this patch tested?
    
    Test manually.
---
 build.gradle.kts                                   |   2 +-
 catalogs/catalog-hive/build.gradle.kts             |   9 ++
 .../hive/integration/test/CatalogHiveABSIT.java    | 124 +++++++++++++++++++++
 dev/docker/hive/hive-dependency.sh                 |   2 +-
 dev/docker/hive/hive-site.xml                      |  10 ++
 dev/docker/hive/start.sh                           |  20 +++-
 docs/docker-image-details.md                       |   2 +
 ...with-s3.md => hive-catalog-with-s3-and-adls.md} |  54 +++++++--
 8 files changed, 211 insertions(+), 12 deletions(-)

diff --git a/build.gradle.kts b/build.gradle.kts
index 3685fc18b..65187e298 100644
--- a/build.gradle.kts
+++ b/build.gradle.kts
@@ -174,7 +174,7 @@ allprojects {
       param.environment("PROJECT_VERSION", project.version)
 
       // Gravitino CI Docker image
-      param.environment("GRAVITINO_CI_HIVE_DOCKER_IMAGE", 
"apache/gravitino-ci:hive-0.1.14")
+      param.environment("GRAVITINO_CI_HIVE_DOCKER_IMAGE", 
"apache/gravitino-ci:hive-0.1.15")
       param.environment("GRAVITINO_CI_KERBEROS_HIVE_DOCKER_IMAGE", 
"apache/gravitino-ci:kerberos-hive-0.1.5")
       param.environment("GRAVITINO_CI_DORIS_DOCKER_IMAGE", 
"apache/gravitino-ci:doris-0.1.5")
       param.environment("GRAVITINO_CI_TRINO_DOCKER_IMAGE", 
"apache/gravitino-ci:trino-0.1.6")
diff --git a/catalogs/catalog-hive/build.gradle.kts 
b/catalogs/catalog-hive/build.gradle.kts
index f7d6e60c1..b328413df 100644
--- a/catalogs/catalog-hive/build.gradle.kts
+++ b/catalogs/catalog-hive/build.gradle.kts
@@ -129,6 +129,15 @@ dependencies {
   testImplementation(libs.testcontainers.mysql)
   testImplementation(libs.testcontainers.localstack)
   testImplementation(libs.hadoop2.aws)
+  testImplementation(libs.hadoop3.abs)
+
+  // You need this to run test CatalogHiveABSIT as it required hadoop3 
environment introduced by hadoop3.abs
+  // (The protocol `abfss` was first introduced in Hadoop 3.2.0), However, as 
the there already exists
+  // hadoop2.common in the test classpath, If we added the following 
dependencies directly, it will
+  // cause the conflict between hadoop2 and hadoop3, resulting test failures, 
so we comment the
+  // following line temporarily, if you want to run the test, please uncomment 
it.
+  // In the future, we may need to refactor the test to avoid the conflict.
+  // testImplementation(libs.hadoop3.common)
 
   testRuntimeOnly(libs.junit.jupiter.engine)
 }
diff --git 
a/catalogs/catalog-hive/src/test/java/org/apache/gravitino/catalog/hive/integration/test/CatalogHiveABSIT.java
 
b/catalogs/catalog-hive/src/test/java/org/apache/gravitino/catalog/hive/integration/test/CatalogHiveABSIT.java
new file mode 100644
index 000000000..aaf44dae5
--- /dev/null
+++ 
b/catalogs/catalog-hive/src/test/java/org/apache/gravitino/catalog/hive/integration/test/CatalogHiveABSIT.java
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.gravitino.catalog.hive.integration.test;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.gravitino.integration.test.container.HiveContainer;
+import org.apache.gravitino.integration.test.util.GravitinoITUtils;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.condition.EnabledIf;
+import org.testcontainers.shaded.com.google.common.collect.ImmutableMap;
+
+// Apart from the following dependencies on environment, this test also needs 
hadoop3-common, please
+// refer to L135 in the file 
`${GRAVITINO_HOME}/catalogs/catalog-hive/build.gradle.kts`, otherwise
+// initFileSystem method in this file will fail to run due to missing 
hadoop3-common.
+@EnabledIf(
+    value = "isAzureBlobStorageConfigured",
+    disabledReason = "Azure Blob Storage is not prepared.")
+public class CatalogHiveABSIT extends CatalogHiveIT {
+
+  private static final String ABS_BUCKET_NAME = 
System.getenv("ABS_CONTAINER_NAME");
+  private static final String ABS_USER_ACCOUNT_NAME = 
System.getenv("ABS_ACCOUNT_NAME");
+  private static final String ABS_USER_ACCOUNT_KEY = 
System.getenv("ABS_ACCOUNT_KEY");
+
+  @Override
+  protected void startNecessaryContainer() {
+    Map<String, String> hiveContainerEnv =
+        ImmutableMap.of(
+            "ABS_ACCOUNT_NAME",
+            ABS_USER_ACCOUNT_NAME,
+            "ABS_ACCOUNT_KEY",
+            ABS_USER_ACCOUNT_KEY,
+            HiveContainer.HIVE_RUNTIME_VERSION,
+            HiveContainer.HIVE3);
+
+    containerSuite.startHiveContainerWithS3(hiveContainerEnv);
+
+    HIVE_METASTORE_URIS =
+        String.format(
+            "thrift://%s:%d",
+            containerSuite.getHiveContainerWithS3().getContainerIpAddress(),
+            HiveContainer.HIVE_METASTORE_PORT);
+  }
+
+  @Override
+  protected void initFileSystem() throws IOException {
+    // Use Azure Blob Storage file system
+    Configuration conf = new Configuration();
+    conf.set(
+        String.format("fs.azure.account.key.%s.dfs.core.windows.net", 
ABS_USER_ACCOUNT_NAME),
+        ABS_USER_ACCOUNT_KEY);
+    conf.set("fs.abfss.impl", 
"org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem");
+
+    String path =
+        String.format("abfss://%s@%s.dfs.core.windows.net", ABS_BUCKET_NAME, 
ABS_USER_ACCOUNT_NAME);
+    fileSystem = FileSystem.get(URI.create(path), conf);
+  }
+
+  @Override
+  protected void initSparkSession() {
+    sparkSession =
+        SparkSession.builder()
+            .master("local[1]")
+            .appName("Hive Catalog integration test")
+            .config("hive.metastore.uris", HIVE_METASTORE_URIS)
+            .config(
+                "spark.sql.warehouse.dir",
+                String.format(
+                    "abfss://%s@%s.dfs.core.windows.net/%s",
+                    ABS_BUCKET_NAME,
+                    ABS_USER_ACCOUNT_NAME,
+                    GravitinoITUtils.genRandomName("CatalogFilesetIT")))
+            .config(
+                String.format(
+                    
"spark.hadoop.fs.azure.account.key.%s.dfs.core.windows.net",
+                    ABS_USER_ACCOUNT_NAME),
+                ABS_USER_ACCOUNT_KEY)
+            .config("fs.abfss.impl", 
"org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem")
+            .config("spark.sql.storeAssignmentPolicy", "LEGACY")
+            .config("mapreduce.input.fileinputformat.input.dir.recursive", 
"true")
+            .enableHiveSupport()
+            .getOrCreate();
+  }
+
+  @Override
+  protected Map<String, String> createSchemaProperties() {
+    Map<String, String> properties = new HashMap<>();
+    properties.put("key1", "val1");
+    properties.put("key2", "val2");
+    properties.put(
+        "location",
+        String.format(
+            "abfss://%s@%s.dfs.core.windows.net/test-%s",
+            ABS_BUCKET_NAME, ABS_USER_ACCOUNT_NAME, 
System.currentTimeMillis()));
+    return properties;
+  }
+
+  private static boolean isAzureBlobStorageConfigured() {
+    return StringUtils.isNotBlank(System.getenv("ABS_ACCOUNT_NAME"))
+        && StringUtils.isNotBlank(System.getenv("ABS_ACCOUNT_KEY"))
+        && StringUtils.isNotBlank(System.getenv("ABS_CONTAINER_NAME"));
+  }
+}
diff --git a/dev/docker/hive/hive-dependency.sh 
b/dev/docker/hive/hive-dependency.sh
index 5ec228003..2038dd001 100755
--- a/dev/docker/hive/hive-dependency.sh
+++ b/dev/docker/hive/hive-dependency.sh
@@ -23,7 +23,7 @@ hive_dir="$(cd "${hive_dir}">/dev/null; pwd)"
 
 # Environment variables definition
 HADOOP2_VERSION="2.7.3"
-HADOOP3_VERSION="3.1.0"
+HADOOP3_VERSION="3.3.0"
 
 HIVE2_VERSION="2.3.9"
 HIVE3_VERSION="3.1.3"
diff --git a/dev/docker/hive/hive-site.xml b/dev/docker/hive/hive-site.xml
index 477187153..c6a247e1a 100644
--- a/dev/docker/hive/hive-site.xml
+++ b/dev/docker/hive/hive-site.xml
@@ -63,4 +63,14 @@
     
<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,com.amazonaws.auth.EnvironmentVariableCredentialsProvider,org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
   </property>
 
+  <property>
+    <name>fs.abfss.impl</name>
+    <value>org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem</value>
+  </property>
+
+  <property>
+    <name>fs.azure.account.key.ABS_ACCOUNT_NAME.dfs.core.windows.net</name>
+    <value>ABS_ACCOUNT_KEY</value>
+  </property>
+
 </configuration>
diff --git a/dev/docker/hive/start.sh b/dev/docker/hive/start.sh
index b9c545a0a..86ced4097 100644
--- a/dev/docker/hive/start.sh
+++ b/dev/docker/hive/start.sh
@@ -22,12 +22,17 @@
 if [[ "${HIVE_RUNTIME_VERSION}" == "hive3" ]]; then
   ln -s ${HIVE3_HOME} ${HIVE_HOME}
   ln -s ${HADOOP3_HOME} ${HADOOP_HOME}
+
+  # Remove guava jar from Hive lib directory and copy from Hadoop as Hive 3.x 
is not compatible with guava 19
+  rm -rf ${HIVE_HOME}/lib/guava*.jar
+  cp ${HADOOP_HOME}/share/hadoop/common/lib/guava*.jar ${HIVE_HOME}/lib
 else
   ln -s ${HIVE2_HOME} ${HIVE_HOME}
   ln -s ${HADOOP2_HOME} ${HADOOP_HOME}
 fi
 
  cp ${HADOOP_HOME}/share/hadoop/tools/lib/*aws* ${HIVE_HOME}/lib
+ cp ${HADOOP_HOME}/share/hadoop/tools/lib/*azure* ${HIVE_HOME}/lib
 
 # Copy Hadoop and Hive configuration file and update hostname
 cp -f ${HADOOP_TMP_CONF_DIR}/* ${HADOOP_CONF_DIR}
@@ -36,9 +41,18 @@ sed -i "s/__REPLACE__HOST_NAME/$(hostname)/g" 
${HADOOP_CONF_DIR}/core-site.xml
 sed -i "s/__REPLACE__HOST_NAME/$(hostname)/g" ${HADOOP_CONF_DIR}/hdfs-site.xml
 sed -i "s/__REPLACE__HOST_NAME/$(hostname)/g" ${HIVE_CONF_DIR}/hive-site.xml
 
-sed -i "s|S3_ACCESS_KEY_ID|${S3_ACCESS_KEY}|g" ${HIVE_CONF_DIR}/hive-site.xml
-sed -i "s|S3_SECRET_KEY_ID|${S3_SECRET_KEY}|g" ${HIVE_CONF_DIR}/hive-site.xml
-sed -i "s|S3_ENDPOINT_ID|${S3_ENDPOINT}|g" ${HIVE_CONF_DIR}/hive-site.xml
+# whether S3 is set
+if [[ -n "${S3_ACCESS_KEY}" && -n "${S3_SECRET_KEY}" && -n "${S3_ENDPOINT}" 
]]; then
+  sed -i "s|S3_ACCESS_KEY_ID|${S3_ACCESS_KEY}|g" ${HIVE_CONF_DIR}/hive-site.xml
+  sed -i "s|S3_SECRET_KEY_ID|${S3_SECRET_KEY}|g" ${HIVE_CONF_DIR}/hive-site.xml
+  sed -i "s|S3_ENDPOINT_ID|${S3_ENDPOINT}|g" ${HIVE_CONF_DIR}/hive-site.xml
+fi
+
+# whether ADLS is set
+if [[ -n "${ABS_ACCOUNT_NAME}" && -n "${ABS_ACCOUNT_KEY}" ]]; then
+  sed -i "s|ABS_ACCOUNT_NAME|${ABS_ACCOUNT_NAME}|g" 
${HIVE_CONF_DIR}/hive-site.xml
+  sed -i "s|ABS_ACCOUNT_KEY|${ABS_ACCOUNT_KEY}|g" 
${HIVE_CONF_DIR}/hive-site.xml
+fi
 
 # Link mysql-connector-java after deciding where HIVE_HOME symbolic link 
points to.
 ln -s 
/opt/mysql-connector-java-${MYSQL_JDBC_DRIVER_VERSION}/mysql-connector-java-${MYSQL_JDBC_DRIVER_VERSION}.jar
 ${HIVE_HOME}/lib
diff --git a/docs/docker-image-details.md b/docs/docker-image-details.md
index 04234d2e3..fed00d83c 100644
--- a/docs/docker-image-details.md
+++ b/docs/docker-image-details.md
@@ -168,6 +168,8 @@ Changelog
 You can use this kind of image to test the catalog of Apache Hive.
 
 Changelog
+- apache/gravitino-ci:hive-0.1.15
+  - Add ADLS related configurations in the `hive-site.xml` file.
 
 - apache/gravitino-ci:hive-0.1.14 
   - Add amazon S3 related configurations in the `hive-site.xml` file.
diff --git a/docs/hive-catalog-with-s3.md 
b/docs/hive-catalog-with-s3-and-adls.md
similarity index 78%
rename from docs/hive-catalog-with-s3.md
rename to docs/hive-catalog-with-s3-and-adls.md
index 0eb332eb9..41b8eef77 100644
--- a/docs/hive-catalog-with-s3.md
+++ b/docs/hive-catalog-with-s3-and-adls.md
@@ -1,8 +1,8 @@
 ---
-title: "Hive catalog with s3"
+title: "Hive catalog with s3 and adls"
 slug: /hive-catalog
 date: 2024-9-24
-keyword: Hive catalog cloud storage S3
+keyword: Hive catalog cloud storage S3 ADLS
 license: "This software is licensed under the Apache License version 2."
 ---
 
@@ -11,11 +11,14 @@ license: "This software is licensed under the Apache 
License version 2."
 
 Since Hive 2.x, Hive has supported S3 as a storage backend, enabling users to 
store and manage data in Amazon S3 directly through Hive. Gravitino enhances 
this capability by supporting the Hive catalog with S3, allowing users to 
efficiently manage the storage locations of files located in S3. This 
integration simplifies data operations and enables seamless access to S3 data 
from Hive queries.
 
-The following sections will guide you through the necessary steps to configure 
the Hive catalog to utilize S3 as a storage backend, including configuration 
details and examples for creating databases and tables.
+For ADLS (aka. Azure Blob Storage (ABS), or Azure Data Lake Storage (v2)), the 
integration is similar to S3. The only difference is the configuration 
properties for ADLS(see below). 
+
+The following sections will guide you through the necessary steps to configure 
the Hive catalog to utilize S3 and ADLS as a storage backend, including 
configuration details and examples for creating databases and tables.
 
 ## Hive metastore configuration
 
-To use the Hive catalog with S3, you must configure your Hive metastore to 
recognize S3 as a storage backend. The following example illustrates the 
required changes in the `hive-site.xml` configuration file:
+
+The following will mainly focus on configuring the Hive metastore to use S3 as 
a storage backend. The same configuration can be applied to ADLS with minor 
changes in the configuration properties. 
 
 ### Example Configuration Changes
 
@@ -41,11 +44,26 @@ Below are the essential properties to add or modify in the 
`hive-site.xml` file
 <!-- The following property is optional and can be replaced with the location 
property in the schema
 definition and table definition, as shown in the examples below. After 
explicitly setting this
 property, you can omit the location property in the schema and table 
definitions.
+
+It's also applicable for ADLS.
 -->
 <property>
-   <name>hive.metastore.warehouse.dir</name>
-   <value>S3_BUCKET_PATH</value>
+  <name>hive.metastore.warehouse.dir</name>
+  <value>S3_BUCKET_PATH</value>
+</property>
+
+
+<!-- The following are for Azure Blob Storage(ADLS) -->
+<property>
+  <name>fs.abfss.impl</name>
+  <value>org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem</value>
+</property>
+
+<property>
+  <name>fs.azure.account.key.ABS_ACCOUNT_NAME.dfs.core.windows.net</name>
+  <value>ABS_ACCOUNT_KEY</value>
 </property>
+
 ```
 
 ### Adding Required JARs
@@ -53,9 +71,14 @@ property, you can omit the location property in the schema 
and table definitions
 After updating the `hive-site.xml`, you need to ensure that the necessary 
S3-related JARs are included in the Hive classpath. You can do this by 
executing the following command:
 ```shell
 cp ${HADOOP_HOME}/share/hadoop/tools/lib/*aws* ${HIVE_HOME}/lib
+
+# For Azure Blob Storage(ADLS)
+cp ${HADOOP_HOME}/share/hadoop/tools/lib/*azure* ${HIVE_HOME}/lib
 ```
+
 Alternatively, you can download the required JARs from the Maven repository 
and place them in the Hive classpath. It is crucial to verify that the JARs are 
compatible with the version of Hadoop you are using to avoid any compatibility 
issue.
 
+
 ### Restart Hive metastore
 
 Once all configurations have been correctly set, restart the Hive cluster to 
apply the changes. This step is essential to ensure that the new configurations 
take effect and that the Hive services can communicate with S3.
@@ -79,6 +102,9 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
   "comment": "comment",
   "properties": {
     "location": "s3a://bucket-name/path"
+     
+     # The following line is for Azure Blob Storage(ADLS)
+     # "location": 
"abfss://[email protected]/path"
   }
 }' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
 ```
@@ -99,6 +125,10 @@ SupportsSchemas supportsSchemas = catalog.asSchemas();
 
 Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
     .put("location", "s3a://bucket-name/path")
+    
+    // The following line is for Azure Blob Storage(ADLS)
+    // .put("location", 
"abfss://[email protected]/path")
+    
     .build();
 Schema schema = supportsSchemas.createSchema("hive_schema",
     "This is a schema",
@@ -194,6 +224,15 @@ To access S3-stored tables using Spark, you need to 
configure the SparkSession a
             .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.secret.key", 
secretKey)
             .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.endpoint", 
getS3Endpoint)
             .config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")
+
+            ## This two is for Azure Blob Storage(ADLS) only
+            .config(
+                String.format(
+                    
"spark.sql.catalog.{hive_catalog_name}.fs.azure.account.key.%s.dfs.core.windows.net",
+                    ABS_USER_ACCOUNT_NAME),
+                ABS_USER_ACCOUNT_KEY)
+            .config("spark.sql.catalog.{hive_catalog_name}.fs.abfss.impl", 
"org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem")
+            
             
.config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.path.style.access", 
"true")
             
.config("spark.sql.catalog.{hive_catalog_name}.fs.s3a.connection.ssl.enabled", 
"false")
             .config(
@@ -208,7 +247,8 @@ To access S3-stored tables using Spark, you need to 
configure the SparkSession a
 ```
 
 :::Note
-Please download [hadoop aws 
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), [aws 
java sdk 
jar](https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle) and 
place them in the classpath of the Spark. If the JARs are missing, Spark will 
not be able to access the S3 storage.
+Please download [Hadoop AWS 
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), [aws 
java sdk 
jar](https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle) and 
place them in the classpath of the Spark. If the JARs are missing, Spark will 
not be able to access the S3 storage.
+Azure Blob Storage(ADLS) requires the [Hadoop Azure 
jar](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure), [Azure 
cloud sdk jar](https://mvnrepository.com/artifact/com.azure/azure-storage-blob) 
to be placed in the classpath of the Spark.
 :::
 
 By following these instructions, you can effectively manage and access your 
S3-stored data through both Hive CLI and Spark, leveraging the capabilities of 
Gravitino for optimal data management.
\ No newline at end of file

(gravitino) branch main updated: [#5557] improvement(CI): Add some docs and tests about how to use Azure Blob Storage(ADLS) in Hive (#5558)

Reply via email to