yuqi1129 commented on code in PR #6059:
URL: https://github.com/apache/gravitino/pull/6059#discussion_r1909009522
##########
docs/hadoop-catalog-with-adls.md:
##########
@@ -0,0 +1,441 @@
+---
+title: "Hadoop catalog with ADLS"
+slug: /hadoop-catalog-with-adls
+date: 2025-01-03
+keyword: Hadoop catalog ADLS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with ADLS (Azure
Blob Storage).
+
+## Prerequisites
+
+To set up a Hadoop catalog with ADLS, follow these steps:
+
+1. Download the
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
file.
+2. Place the downloaded file into the Gravitino Hadoop catalog classpath at
`${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
+3. Start the Gravitino server by running the following command:
+
+```bash
+$ bin/gravitino-server.sh start
+```
+Once the server is up and running, you can proceed to configure the Hadoop
catalog with ADLS.
+
+## Configurations for creating a Hadoop catalog with ADLS
+
+### Configuration for a ADLS Hadoop catalog
+
+Apart from configurations mentioned in
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties),
the following properties are required to configure a Hadoop catalog with ADLS:
+
+| Configuration item | Description
| Default value | Required
| Since version |
+|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------|------------------|
+| `filesystem-providers` | The file system providers to add. Set it
to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that
contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including
`abs`. | (none) | Yes
| 0.8.0-incubating |
+| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for Azure Blob Storage, if we set this value, we can
omit the prefix 'abfss://' in the location. | `builtin-local` | No
| 0.8.0-incubating |
+| `azure-storage-account-name ` | The account name of Azure Blob Storage.
| (none) | Yes if it's a Azure
Blob Storage fileset. | 0.8.0-incubating |
+| `azure-storage-account-key` | The account key of Azure Blob Storage.
| (none) | Yes if it's a Azure
Blob Storage fileset. | 0.8.0-incubating |
+
+### Configurations for a schema
+
+Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for
more details.
+
+### Configurations for a fileset
+
+Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for
more details.
+
+## Example of creating Hadoop catalog with ADLS
+
+This section demonstrates how to create the Hadoop catalog with ADLS in
Gravitino, with a complete example.
+
+### Step1: Create a Hadoop catalog with ADLS
+
+First, you need to create a Hadoop catalog with ADLS. The following example
shows how to create a Hadoop catalog with ADLS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "abfss://[email protected]/path",
+ "azure-storage-account-name": "The account name of the Azure Blob Storage",
+ "azure-storage-account-key": "The account key of the Azure Blob Storage",
+ "filesystem-providers": "abs"
+ }
+}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("${GRAVITINO_SERVER_IP:PORT}")
+ .withMetalake("metalake")
+ .build();
+
+adlsProperties = ImmutableMap.<String, String>builder()
+ .put("location",
"abfss://[email protected]/path")
+ .put("azure-storage-account-name", "azure storage account name")
+ .put("azure-storage-account-key", "azure storage account key")
+ .put("filesystem-providers", "abs")
+ .build();
+
+Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a ADLS fileset catalog",
+ adlsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake")
+adls_properties = {
+ "location": "abfss://[email protected]/path",
+ "azure_storage_account_name": "azure storage account name",
+ "azure_storage_account_key": "azure storage account key"
+}
+
+adls_properties = gravitino_client.create_catalog(name="example_catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a ADLS fileset
catalog",
+ properties=adls_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+### Step2: Create a schema
+
+Once the catalog is created, you can create a schema. The following example
shows how to create a schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "comment",
+ "properties": {
+ "location": "abfss://[email protected]/path"
+ }
+}'
${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ .put("location",
"abfss://[email protected]/path")
+ .build();
+Schema schema = supportsSchemas.createSchema("test_schema",
+ "This is a schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_schemas().create_schema(name="test_schema",
+ comment="This is a schema",
+ properties={"location":
"abfss://[email protected]/path"})
+```
+
+</TabItem>
+</Tabs>
+
+### Step3: Create a fileset
+
+After creating the schema, you can create a fileset. The following example
shows how to create a fileset:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation":
"abfss://[email protected]/path/example_fileset",
+ "properties": {
+ "k1": "v1"
+ }
+}'
${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("${GRAVITINO_SERVER_IP:PORT}")
+ .withMetalake("metalake")
+ .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+ .put("k1", "v1")
+ .build();
+
+filesetCatalog.createFileset(
+ NameIdentifier.of("test_schema", "example_fileset"),
+ "This is an example fileset",
+ Fileset.Type.MANAGED,
+ "abfss://[email protected]/path/example_fileset",
+ propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema",
"example_fileset"),
+ type=Fileset.Type.MANAGED,
+ comment="This is an example
fileset",
+
storage_location="abfss://[email protected]/path/example_fileset",
+ properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Accessing a fileset with ADLS
+
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset,
GravitinoAdminClient
+from pyspark.sql import SparkSession
Review Comment:
ok,
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]