This is an automated email from the ASF dual-hosted git repository.
fanng pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gravitino.git
The following commit(s) were added to refs/heads/main by this push:
new 5caa9de4f5 [#5472] improvement(docs): Add example to use cloud storage
fileset and polish hadoop-catalog document. (#6059)
5caa9de4f5 is described below
commit 5caa9de4f54f7c2c92156c6a427082eeb28ad49b
Author: Qi Yu <[email protected]>
AuthorDate: Tue Jan 14 18:45:56 2025 +0800
[#5472] improvement(docs): Add example to use cloud storage fileset and
polish hadoop-catalog document. (#6059)
### What changes were proposed in this pull request?
1. Add full example about how to use cloud storage fileset like S3, GCS,
OSS and ADLS
2. Polish how-to-use-gvfs.md and hadoop-catalog-md.
3. Add document how fileset using credential.
### Why are the changes needed?
For better user experience.
Fix: #5472
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
N/A
---
.../gravitino/filesystem/gvfs_config.py | 4 +-
docs/hadoop-catalog-index.md | 26 +
docs/hadoop-catalog-with-adls.md | 522 ++++++++++++++++++++
docs/hadoop-catalog-with-gcs.md | 500 +++++++++++++++++++
docs/hadoop-catalog-with-oss.md | 538 ++++++++++++++++++++
docs/hadoop-catalog-with-s3.md | 541 +++++++++++++++++++++
docs/hadoop-catalog.md | 87 +---
docs/how-to-use-gvfs.md | 173 +------
docs/manage-fileset-metadata-using-gravitino.md | 59 +--
9 files changed, 2157 insertions(+), 293 deletions(-)
diff --git a/clients/client-python/gravitino/filesystem/gvfs_config.py
b/clients/client-python/gravitino/filesystem/gvfs_config.py
index 6fbd8a99d1..34db72adee 100644
--- a/clients/client-python/gravitino/filesystem/gvfs_config.py
+++ b/clients/client-python/gravitino/filesystem/gvfs_config.py
@@ -42,8 +42,8 @@ class GVFSConfig:
GVFS_FILESYSTEM_OSS_SECRET_KEY = "oss_secret_access_key"
GVFS_FILESYSTEM_OSS_ENDPOINT = "oss_endpoint"
- GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "abs_account_name"
- GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "abs_account_key"
+ GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "azure_storage_account_name"
+ GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "azure_storage_account_key"
# This configuration marks the expired time of the credential. For
instance, if the credential
# fetched from Gravitino server has expired time of 3600 seconds, and the
credential_expired_time_ration is 0.5
diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md
new file mode 100644
index 0000000000..dfa7a18717
--- /dev/null
+++ b/docs/hadoop-catalog-index.md
@@ -0,0 +1,26 @@
+---
+title: "Hadoop catalog index"
+slug: /hadoop-catalog-index
+date: 2025-01-13
+keyword: Hadoop catalog index S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+### Hadoop catalog overall
+
+Gravitino Hadoop catalog index includes the following chapters:
+
+- [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter
provides an overview of the Hadoop catalog, its features, capabilities and
related configurations.
+- [Manage Hadoop catalog with Gravitino
API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how
to manage fileset metadata using Gravitino API and provides detailed examples.
+- [Using Hadoop catalog with Gravitino virtual file
system](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog
with the Gravitino virtual file system and provides detailed examples.
+
+### Hadoop catalog with cloud storage
+
+Apart from the above, you can also refer to the following topics to manage and
access cloud storage like S3, GCS, ADLS, and OSS:
+
+- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md).
+- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md).
+- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md).
+- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md).
+
+More storage options will be added soon. Stay tuned!
\ No newline at end of file
diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md
new file mode 100644
index 0000000000..96126c6fab
--- /dev/null
+++ b/docs/hadoop-catalog-with-adls.md
@@ -0,0 +1,522 @@
+---
+title: "Hadoop catalog with ADLS"
+slug: /hadoop-catalog-with-adls
+date: 2025-01-03
+keyword: Hadoop catalog ADLS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with ADLS (aka.
Azure Blob Storage (ABS), or Azure Data Lake Storage (v2)).
+
+## Prerequisites
+
+To set up a Hadoop catalog with ADLS, follow these steps:
+
+1. Download the
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
file.
+2. Place the downloaded file into the Gravitino Hadoop catalog classpath at
`${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
+3. Start the Gravitino server by running the following command:
+
+```bash
+$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start
+```
+
+Once the server is up and running, you can proceed to configure the Hadoop
catalog with ADLS. In the rest of this document we will use
`http://localhost:8090` as the Gravitino server URL, please replace it with
your actual server URL.
+
+## Configurations for creating a Hadoop catalog with ADLS
+
+### Configuration for a ADLS Hadoop catalog
+
+Apart from configurations mentioned in
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties),
the following properties are required to configure a Hadoop catalog with ADLS:
+
+| Configuration item | Description
[...]
+|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| `filesystem-providers` | The file system providers to add. Set it to
`abs` if it's a Azure Blob Storage fileset, or a comma separated string that
contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including
`abs`.
[...]
+| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for Azure Blob Storage, if we set this value, we can
omit the prefix 'abfss://' in the location.
[...]
+| `azure-storage-account-name ` | The account name of Azure Blob Storage.
[...]
+| `azure-storage-account-key` | The account key of Azure Blob Storage.
[...]
+| `credential-providers` | The credential provider types, separated by
comma, possible value can be `adls-token`, `azure-account-key`. As the default
authentication type is using account name and account key as the above, this
configuration can enable credential vending provided by Gravitino server and
client will no longer need to provide authentication information like
account_name/account_key to access ADLS by GVFS. Once it's set, more
configuration items are needed to make it [...]
+
+
+### Configurations for a schema
+
+Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for
more details.
+
+### Configurations for a fileset
+
+Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for
more details.
+
+## Example of creating Hadoop catalog with ADLS
+
+This section demonstrates how to create the Hadoop catalog with ADLS in
Gravitino, with a complete example.
+
+### Step1: Create a Hadoop catalog with ADLS
+
+First, you need to create a Hadoop catalog with ADLS. The following example
shows how to create a Hadoop catalog with ADLS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_catalog",
+ "type": "FILESET",
+ "comment": "This is a ADLS fileset catalog",
+ "provider": "hadoop",
+ "properties": {
+ "location": "abfss://[email protected]/path",
+ "azure-storage-account-name": "The account name of the Azure Blob Storage",
+ "azure-storage-account-key": "The account key of the Azure Blob Storage",
+ "filesystem-providers": "abs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Map<String, String> adlsProperties = ImmutableMap.<String, String>builder()
+ .put("location",
"abfss://[email protected]/path")
+ .put("azure-storage-account-name", "azure storage account name")
+ .put("azure-storage-account-key", "azure storage account key")
+ .put("filesystem-providers", "abs")
+ .build();
+
+Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a ADLS fileset catalog",
+ adlsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+adls_properties = {
+ "location": "abfss://[email protected]/path",
+ "azure-storage-account-name": "azure storage account name",
+ "azure-storage-account-key": "azure storage account key",
+ "filesystem-providers": "abs"
+}
+
+adls_properties = gravitino_client.create_catalog(name="example_catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a ADLS
fileset catalog",
+ properties=adls_properties)
+```
+
+</TabItem>
+</Tabs>
+
+### Step2: Create a schema
+
+Once the catalog is created, you can create a schema. The following example
shows how to create a schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "This is a ADLS schema",
+ "properties": {
+ "location": "abfss://[email protected]/path"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ .put("location",
"abfss://[email protected]/path")
+ .build();
+Schema schema = supportsSchemas.createSchema("test_schema",
+ "This is a ADLS schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_schemas().create_schema(name="test_schema",
+ comment="This is a ADLS schema",
+ properties={"location":
"abfss://[email protected]/path"})
+```
+
+</TabItem>
+</Tabs>
+
+### Step3: Create a fileset
+
+After creating the schema, you can create a fileset. The following example
shows how to create a fileset:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation":
"abfss://[email protected]/path/example_fileset",
+ "properties": {
+ "k1": "v1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+ .put("k1", "v1")
+ .build();
+
+filesetCatalog.createFileset(
+ NameIdentifier.of("test_schema", "example_fileset"),
+ "This is an example fileset",
+ Fileset.Type.MANAGED,
+ "abfss://[email protected]/path/example_fileset",
+ propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema",
"example_fileset"),
+ type=Fileset.Type.MANAGED,
+ comment="This is an example
fileset",
+
storage_location="abfss://[email protected]/path/example_fileset",
+ properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Accessing a fileset with ADLS
+
+### Using the GVFS Java client to access the fileset
+
+To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client,
based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1),
you need to add the following configurations:
+
+| Configuration item | Description |
Default value | Required | Since version |
+|------------------------------|-----------------------------------------|---------------|----------|------------------|
+| `azure-storage-account-name` | The account name of Azure Blob Storage. |
(none) | Yes | 0.8.0-incubating |
+| `azure-storage-account-key` | The account key of Azure Blob Storage. |
(none) | Yes | 0.8.0-incubating |
+
+:::note
+If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
More details can be found in [Fileset with credential
vending](#fileset-with-credential-vending).
+:::
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+conf.set("azure-storage-account-name", "account_name_of_adls");
+conf.set("azure-storage-account-key", "account_key_of_adls");
+Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Similar to Spark configurations, you need to add ADLS (bundle) jars to the
classpath according to your environment.
+
+If your wants to custom your hadoop version or there is already a hadoop
version in your project, you can add the following dependencies to your
`pom.xml`:
+
+```xml
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-common</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-azure</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-azure</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+Or use the bundle jar with Hadoop environment if there is no Hadoop
environment:
+
+```xml
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-azure-bundle</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+Then you can run the following code:
+
+```python
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "your_adls_catalog"
+schema_name = "your_adls_schema"
+fileset_name = "your_adls_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("adls_fileset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.azure-storage-account-name", "azure_account_name")
+ .config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+ .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization",
"true")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path =
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+ .mode("overwrite")
+ .option("header", "true")
+ .csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code
snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the
same environment variables
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
--master local[1] pyspark-shell"
+```
+
+-
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure`
jar.
+-
[`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure)
is a condensed version of the Gravitino ADLS bundle jar without Hadoop
environment and `hadoop-azure` jar.
+- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
+
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is necessary for the driver,
adding the bundle jars with '--jars' may not work. If this is the case, you
should add the jars to the spark CLASSPATH directly.
+:::
+
+### Accessing a fileset using the Hadoop fs command
+
+The following are examples of how to use the `hadoop fs` command to access the
fileset in Hadoop 3.1.3.
+
+1. Adding the following contents to the
`${HADOOP_HOME}/etc/hadoop/core-site.xml` file:
+
+```xml
+ <property>
+ <name>fs.AbstractFileSystem.gvfs.impl</name>
+ <value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
+ </property>
+
+ <property>
+ <name>fs.gvfs.impl</name>
+
<value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.server.uri</name>
+ <value>http://localhost:8090</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.client.metalake</name>
+ <value>test</value>
+ </property>
+
+ <property>
+ <name>azure-storage-account-name</name>
+ <value>account_name</value>
+ </property>
+ <property>
+ <name>azure-storage-account-key</name>
+ <value>account_key</value>
+ </property>
+```
+
+2. Add the necessary jars to the Hadoop classpath.
+
+For ADLS, you need to add
`gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`,
`gravitino-azure-${gravitino-version}.jar` and
`hadoop-azure-${hadoop-version}.jar` located at
`${HADOOP_HOME}/share/hadoop/tools/lib/` to the Hadoop classpath.
+
+3. Run the following command to access the fileset:
+
+```shell
+./${HADOOP_HOME}/bin/hadoop dfs -ls
gvfs://fileset/adls_catalog/adls_schema/adls_fileset
+./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file
gvfs://fileset/adls_catalog/adls_schema/adls_fileset
+```
+
+### Using the GVFS Python client to access a fileset
+
+In order to access fileset with Azure Blob storage (ADLS) using the GVFS
Python client, apart from [basic GVFS
configurations](./how-to-use-gvfs.md#configuration-1), you need to add the
following configurations:
+
+| Configuration item | Description |
Default value | Required | Since version |
+|------------------------------|----------------------------------------|---------------|----------|------------------|
+| `azure_storage_account_name` | The account name of Azure Blob Storage |
(none) | Yes | 0.8.0-incubating |
+| `azure_storage_account_key` | The account key of Azure Blob Storage |
(none) | Yes | 0.8.0-incubating |
+
+:::note
+If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
+:::
+
+Please install the `gravitino` package before running the following code:
+
+```bash
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+
+```python
+from gravitino import gvfs
+options = {
+ "cache_size": 20,
+ "cache_expired_time": 3600,
+ "auth_type": "simple",
+ "azure_storage_account_name": "azure_account_name",
+ "azure_storage_account_key": "azure_account_key"
+}
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake", options=options)
+fs.ls("gvfs://fileset/{adls_catalog}/{adls_schema}/{adls_fileset}/")
+```
+
+
+### Using fileset with pandas
+
+The following are examples of how to use the pandas library to access the ADLS
fileset
+
+```python
+import pandas as pd
+
+storage_options = {
+ "server_uri": "http://localhost:8090",
+ "metalake_name": "test",
+ "options": {
+ "azure_storage_account_name": "azure_account_name",
+ "azure_storage_account_key": "azure_account_key"
+ }
+}
+ds =
pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv",
+ storage_options=storage_options)
+ds.head()
+```
+
+For other use cases, please refer to the [Gravitino Virtual File
System](./how-to-use-gvfs.md) document.
+
+## Fileset with credential vending
+
+Since 0.8.0-incubating, Gravitino supports credential vending for ADLS
fileset. If the catalog has been [configured with
credential](./security/credential-vending.md), you can access ADLS fileset
without providing authentication information like `azure-storage-account-name`
and `azure-storage-account-key` in the properties.
+
+### How to create an ADLS Hadoop catalog with credential enabled
+
+Apart from configuration method in
[create-adls-hadoop-catalog](#configuration-for-a-adls-hadoop-catalog),
properties needed by
[adls-credential](./security/credential-vending.md#adls-credentials) should
also be set to enable credential vending for ADLS fileset.
+
+### How to access ADLS fileset with credential
+
+If the catalog has been configured with credential, you can access ADLS
fileset without providing authentication information via GVFS Java/Python
client and Spark. Let's see how to access ADLS fileset with credential:
+
+GVFS Java client:
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+// No need to set azure-storage-account-name and azure-storage-account-name
+Path filesetPath = new
Path("gvfs://fileset/adls_test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Spark:
+
+```python
+spark = SparkSession.builder
+ .appName("adls_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ # No need to set azure-storage-account-name and azure-storage-account-name
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+```
+
+Python client and Hadoop command are similar to the above examples.
+
diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md
new file mode 100644
index 0000000000..a3eb034b4f
--- /dev/null
+++ b/docs/hadoop-catalog-with-gcs.md
@@ -0,0 +1,500 @@
+---
+title: "Hadoop catalog with GCS"
+slug: /hadoop-catalog-with-gcs
+date: 2024-01-03
+keyword: Hadoop catalog GCS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with GCS.
+
+## Prerequisites
+To set up a Hadoop catalog with OSS, follow these steps:
+
+1. Download the
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
file.
+2. Place the downloaded file into the Gravitino Hadoop catalog classpath at
`${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
+3. Start the Gravitino server by running the following command:
+
+```bash
+$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start
+```
+
+Once the server is up and running, you can proceed to configure the Hadoop
catalog with GCS. In the rest of this document we will use
`http://localhost:8090` as the Gravitino server URL, please replace it with
your actual server URL.
+
+## Configurations for creating a Hadoop catalog with GCS
+
+### Configurations for a GCS Hadoop catalog
+
+Apart from configurations mentioned in
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties),
the following properties are required to configure a Hadoop catalog with GCS:
+
+| Configuration item | Description
[...]
+|-------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| `filesystem-providers` | The file system providers to add. Set it to
`gcs` if it's a GCS fileset, a comma separated string that contains `gcs` like
`gcs,s3` to support multiple kinds of fileset including `gcs`.
[...]
+| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for GCS, if we set this value, we can omit the prefix
'gs://' in the location.
[...]
+| `gcs-service-account-file` | The path of GCS service account JSON file.
[...]
+| `credential-providers` | The credential provider types, separated by
comma, possible value can be `gcs-token`. As the default authentication type is
using service account as the above, this configuration can enable credential
vending provided by Gravitino server and client will no longer need to provide
authentication information like service account to access GCS by GVFS. Once
it's set, more configuration items are needed to make it works, please see
[gcs-credential-vending](se [...]
+
+
+### Configurations for a schema
+
+Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for
more details.
+
+### Configurations for a fileset
+
+Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for
more details.
+
+## Example of creating Hadoop catalog with GCS
+
+This section will show you how to use the Hadoop catalog with GCS in
Gravitino, including detailed examples.
+
+### Create a Hadoop catalog with GCS
+
+First, you need to create a Hadoop catalog with GCS. The following example
shows how to create a Hadoop catalog with GCS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_catalog",
+ "type": "FILESET",
+ "comment": "This is a GCS fileset catalog",
+ "provider": "hadoop",
+ "properties": {
+ "location": "gs://bucket/root",
+ "gcs-service-account-file": "path_of_gcs_service_account_file",
+ "filesystem-providers": "gcs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Map<String, String> gcsProperties = ImmutableMap.<String, String>builder()
+ .put("location", "gs://bucket/root")
+ .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+ .put("filesystem-providers", "gcs")
+ .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("test_catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a GCS fileset catalog",
+ gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+gcs_properties = {
+ "location": "gs://bucket/root",
+ "gcs-service-account-file": "path_of_gcs_service_account_file",
+ "filesystem-providers": "gcs"
+}
+
+gcs_properties = gravitino_client.create_catalog(name="test_catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a GCS
fileset catalog",
+ properties=gcs_properties)
+```
+
+</TabItem>
+</Tabs>
+
+### Step2: Create a schema
+
+Once you have created a Hadoop catalog with GCS, you can create a schema. The
following example shows how to create a schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "This is a GCS schema",
+ "properties": {
+ "location": "gs://bucket/root/schema"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ .put("location", "gs://bucket/root/schema")
+ .build();
+Schema schema = supportsSchemas.createSchema("test_schema",
+ "This is a GCS schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_schemas().create_schema(name="test_schema",
+ comment="This is a GCS schema",
+ properties={"location":
"gs://bucket/root/schema"})
+```
+
+</TabItem>
+</Tabs>
+
+
+### Step3: Create a fileset
+
+After creating a schema, you can create a fileset. The following example shows
how to create a fileset:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation": "gs://bucket/root/schema/example_fileset",
+ "properties": {
+ "k1": "v1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+ .put("k1", "v1")
+ .build();
+
+filesetCatalog.createFileset(
+ NameIdentifier.of("test_schema", "example_fileset"),
+ "This is an example fileset",
+ Fileset.Type.MANAGED,
+ "gs://bucket/root/schema/example_fileset",
+ propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema",
"example_fileset"),
+ type=Fileset.Type.MANAGED,
+ comment="This is an example
fileset",
+
storage_location="gs://bucket/root/schema/example_fileset",
+ properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Accessing a fileset with GCS
+
+### Using the GVFS Java client to access the fileset
+
+To access fileset with GCS using the GVFS Java client, based on the [basic
GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the
following configurations:
+
+| Configuration item | Description |
Default value | Required | Since version |
+|----------------------------|--------------------------------------------|---------------|----------|------------------|
+| `gcs-service-account-file` | The path of GCS service account JSON file. |
(none) | Yes | 0.7.0-incubating |
+
+:::note
+If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
More details can be found in [Fileset with credential
vending](#fileset-with-credential-vending).
+:::
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+conf.set("gcs-service-account-file", "/path/your-service-account-file.json");
+Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Similar to Spark configurations, you need to add GCS (bundle) jars to the
classpath according to your environment.
+If your wants to custom your hadoop version or there is already a hadoop
version in your project, you can add the following dependencies to your
`pom.xml`:
+
+```xml
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-common</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+ <dependency>
+ <groupId>com.google.cloud.bigdataoss</groupId>
+ <artifactId>gcs-connector</artifactId>
+ <version>${GCS_CONNECTOR_VERSION}</version>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-gcp</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+Or use the bundle jar with Hadoop environment if there is no Hadoop
environment:
+
+```xml
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-gcp-bundle</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+Then you can run the following code:
+
+```python
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "your_gcs_catalog"
+schema_name = "your_gcs_schema"
+fileset_name = "your_gcs_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("gcs_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test_metalake")
+ .config("spark.hadoop.gcs-service-account-file",
"/path/to/gcs-service-account-file.json")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path =
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+ .mode("overwrite")
+ .option("header", "true")
+ .csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code
snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the
same environment variables
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
--master local[1] pyspark-shell"
+```
+
+-
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
is the Gravitino GCP jar with Hadoop environment(3.3.1) and `gcs-connector`.
+-
[`gravitino-gcp-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp)
is a condensed version of the Gravitino GCP bundle jar without Hadoop
environment and
[`gcs-connector`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar)
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is needed by the driver, adding
the bundle jars with '--jars' may not work. If this is the case, you should add
the jars to the spark CLASSPATH directly.
+:::
+
+### Accessing a fileset using the Hadoop fs command
+
+The following are examples of how to use the `hadoop fs` command to access the
fileset in Hadoop 3.1.3.
+
+1. Adding the following contents to the
`${HADOOP_HOME}/etc/hadoop/core-site.xml` file:
+
+```xml
+ <property>
+ <name>fs.AbstractFileSystem.gvfs.impl</name>
+ <value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
+ </property>
+
+ <property>
+ <name>fs.gvfs.impl</name>
+
<value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.server.uri</name>
+ <value>http://localhost:8090</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.client.metalake</name>
+ <value>test</value>
+ </property>
+
+ <property>
+ <name>gcs-service-account-file</name>
+ <value>/path/your-service-account-file.json</value>
+ </property>
+```
+
+2. Add the necessary jars to the Hadoop classpath.
+
+For GCS, you need to add
`gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`,
`gravitino-gcp-${gravitino-version}.jar` and
[`gcs-connector-hadoop3-2.2.22-shaded.jar`](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases/download/v2.2.22/gcs-connector-hadoop3-2.2.22-shaded.jar)
to Hadoop classpath.
+
+3. Run the following command to access the fileset:
+
+```shell
+./${HADOOP_HOME}/bin/hadoop dfs -ls
gvfs://fileset/gcs_catalog/gcs_schema/gcs_example
+./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file
gvfs://fileset/gcs_catalog/gcs_schema/gcs_example
+```
+
+### Using the GVFS Python client to access a fileset
+
+In order to access fileset with GCS using the GVFS Python client, apart from
[basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to
add the following configurations:
+
+| Configuration item | Description |
Default value | Required | Since version |
+|----------------------------|-------------------------------------------|---------------|----------|------------------|
+| `gcs_service_account_file` | The path of GCS service account JSON file.|
(none) | Yes | 0.7.0-incubating |
+
+:::note
+If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
+:::
+
+Please install the `gravitino` package before running the following code:
+
+```bash
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+
+```python
+from gravitino import gvfs
+options = {
+ "cache_size": 20,
+ "cache_expired_time": 3600,
+ "auth_type": "simple",
+ "gcs_service_account_file": "path_of_gcs_service_account_file.json",
+}
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake", options=options)
+fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/")
+```
+
+### Using fileset with pandas
+
+The following are examples of how to use the pandas library to access the GCS
fileset
+
+```python
+import pandas as pd
+
+storage_options = {
+ "server_uri": "http://localhost:8090",
+ "metalake_name": "test",
+ "options": {
+ "gcs_service_account_file": "path_of_gcs_service_account_file.json",
+ }
+}
+ds =
pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv",
+ storage_options=storage_options)
+ds.head()
+```
+
+For other use cases, please refer to the [Gravitino Virtual File
System](./how-to-use-gvfs.md) document.
+
+## Fileset with credential vending
+
+Since 0.8.0-incubating, Gravitino supports credential vending for GCS fileset.
If the catalog has been [configured with
credential](./security/credential-vending.md), you can access GCS fileset
without providing authentication information like `gcs-service-account-file` in
the properties.
+
+### How to create a GCS Hadoop catalog with credential enabled
+
+Apart from configuration method in
[create-gcs-hadoop-catalog](#configurations-for-a-gcs-hadoop-catalog),
properties needed by
[gcs-credential](./security/credential-vending.md#gcs-credentials) should also
be set to enable credential vending for GCS fileset.
+
+### How to access GCS fileset with credential
+
+If the catalog has been configured with credential, you can access GCS fileset
without providing authentication information via GVFS Java/Python client and
Spark. Let's see how to access GCS fileset with credential:
+
+GVFS Java client:
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+// No need to set gcs-service-account-file
+Path filesetPath = new
Path("gvfs://fileset/gcs_test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Spark:
+
+```python
+spark = SparkSession.builder
+ .appName("gcs_fileset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ # No need to set gcs-service-account-file
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+```
+
+Python client and Hadoop command are similar to the above examples.
diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md
new file mode 100644
index 0000000000..e63935c720
--- /dev/null
+++ b/docs/hadoop-catalog-with-oss.md
@@ -0,0 +1,538 @@
+---
+title: "Hadoop catalog with OSS"
+slug: /hadoop-catalog-with-oss
+date: 2025-01-03
+keyword: Hadoop catalog OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document explains how to configure a Hadoop catalog with Aliyun OSS
(Object Storage Service) in Gravitino.
+
+## Prerequisites
+
+To set up a Hadoop catalog with OSS, follow these steps:
+
+1. Download the
[`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle)
file.
+2. Place the downloaded file into the Gravitino Hadoop catalog classpath at
`${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
+3. Start the Gravitino server by running the following command:
+
+```bash
+$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start
+```
+
+Once the server is up and running, you can proceed to configure the Hadoop
catalog with OSS. In the rest of this document we will use
`http://localhost:8090` as the Gravitino server URL, please replace it with
your actual server URL.
+
+## Configurations for creating a Hadoop catalog with OSS
+
+### Configuration for an OSS Hadoop catalog
+
+In addition to the basic configurations mentioned in
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties),
the following properties are required to configure a Hadoop catalog with OSS:
+
+| Configuration item | Description
[...]
+|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| `filesystem-providers` | The file system providers to add. Set it to
`oss` if it's a OSS fileset, or a comma separated string that contains `oss`
like `oss,gs,s3` to support multiple kinds of fileset including `oss`.
[...]
+| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for OSS, if we set this value, we can omit the prefix
'oss://' in the location.
[...]
+| `oss-endpoint` | The endpoint of the Aliyun OSS.
[...]
+| `oss-access-key-id` | The access key of the Aliyun OSS.
[...]
+| `oss-secret-access-key` | The secret key of the Aliyun OSS.
[...]
+| `credential-providers` | The credential provider types, separated by
comma, possible value can be `oss-token`, `oss-secret-key`. As the default
authentication type is using AKSK as the above, this configuration can enable
credential vending provided by Gravitino server and client will no longer need
to provide authentication information like AKSK to access OSS by GVFS. Once
it's set, more configuration items are needed to make it works, please see
[oss-credential-vending](secur [...]
+
+
+### Configurations for a schema
+
+To create a schema, refer to [Schema
configurations](./hadoop-catalog.md#schema-properties).
+
+### Configurations for a fileset
+
+For instructions on how to create a fileset, refer to [Fileset
configurations](./hadoop-catalog.md#fileset-properties) for more details.
+
+## Example of creating Hadoop catalog/schema/fileset with OSS
+
+This section will show you how to use the Hadoop catalog with OSS in
Gravitino, including detailed examples.
+
+### Step1: Create a Hadoop catalog with OSS
+
+First, you need to create a Hadoop catalog for OSS. The following examples
demonstrate how to create a Hadoop catalog with OSS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_catalog",
+ "type": "FILESET",
+ "comment": "This is a OSS fileset catalog",
+ "provider": "hadoop",
+ "properties": {
+ "location": "oss://bucket/root",
+ "oss-access-key-id": "access_key",
+ "oss-secret-access-key": "secret_key",
+ "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com",
+ "filesystem-providers": "oss"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Map<String, String> ossProperties = ImmutableMap.<String, String>builder()
+ .put("location", "oss://bucket/root")
+ .put("oss-access-key-id", "access_key")
+ .put("oss-secret-access-key", "secret_key")
+ .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com")
+ .put("filesystem-providers", "oss")
+ .build();
+
+Catalog ossCatalog = gravitinoClient.createCatalog("test_catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a OSS fileset catalog",
+ ossProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+oss_properties = {
+ "location": "oss://bucket/root",
+ "oss-access-key-id": "access_key"
+ "oss-secret-access-key": "secret_key",
+ "oss-endpoint": "ossProperties",
+ "filesystem-providers": "oss"
+}
+
+oss_catalog = gravitino_client.create_catalog(name="test_catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a OSS fileset
catalog",
+ properties=oss_properties)
+```
+
+</TabItem>
+</Tabs>
+
+Step 2: Create a Schema
+
+Once the Hadoop catalog with OSS is created, you can create a schema inside
that catalog. Below are examples of how to do this:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "This is a OSS schema",
+ "properties": {
+ "location": "oss://bucket/root/schema"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ .put("location", "oss://bucket/root/schema")
+ .build();
+Schema schema = supportsSchemas.createSchema("test_schema",
+ "This is a OSS schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_schemas().create_schema(name="test_schema",
+ comment="This is a OSS schema",
+ properties={"location":
"oss://bucket/root/schema"})
+```
+
+</TabItem>
+</Tabs>
+
+
+### Create a fileset
+
+Now that the schema is created, you can create a fileset inside it. Here’s how:
+
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation": "oss://bucket/root/schema/example_fileset",
+ "properties": {
+ "k1": "v1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+ .put("k1", "v1")
+ .build();
+
+filesetCatalog.createFileset(
+ NameIdentifier.of("test_schema", "example_fileset"),
+ "This is an example fileset",
+ Fileset.Type.MANAGED,
+ "oss://bucket/root/schema/example_fileset",
+ propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema",
"example_fileset"),
+ type=Fileset.Type.MANAGED,
+ comment="This is an example
fileset",
+
storage_location="oss://bucket/root/schema/example_fileset",
+ properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Accessing a fileset with OSS
+
+### Using the GVFS Java client to access the fileset
+
+To access fileset with OSS using the GVFS Java client, based on the [basic
GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to add the
following configurations:
+
+| Configuration item | Description | Default value
| Required | Since version |
+|-------------------------|-----------------------------------|---------------|----------|------------------|
+| `oss-endpoint` | The endpoint of the Aliyun OSS. | (none)
| Yes | 0.7.0-incubating |
+| `oss-access-key-id` | The access key of the Aliyun OSS. | (none)
| Yes | 0.7.0-incubating |
+| `oss-secret-access-key` | The secret key of the Aliyun OSS. | (none)
| Yes | 0.7.0-incubating |
+
+:::note
+If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
More details can be found in [Fileset with credential
vending](#fileset-with-credential-vending).
+:::
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+conf.set("oss-endpoint", "http://localhost:8090");
+conf.set("oss-access-key-id", "minio");
+conf.set("oss-secret-access-key", "minio123");
+Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Similar to Spark configurations, you need to add OSS (bundle) jars to the
classpath according to your environment.
+If your wants to custom your hadoop version or there is already a hadoop
version in your project, you can add the following dependencies to your
`pom.xml`:
+
+```xml
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-common</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-aliyun</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-aliyun</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+Or use the bundle jar with Hadoop environment if there is no Hadoop
environment:
+
+```xml
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-aliyun-bundle</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+Then you can run the following code:
+
+```python
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "your_oss_catalog"
+schema_name = "your_oss_schema"
+fileset_name = "your_oss_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("oss_fileset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "${_URL}")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"])
+ .config("spark.hadoop.oss-secret-access-key",
os.environ["OSS_SECRET_ACCESS_KEY"])
+ .config("spark.hadoop.oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path =
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+ .mode("overwrite")
+ .option("header", "true")
+ .csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code
snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the
same environment variables
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
--master local[1] pyspark-shell"
+```
+
+-
[`gravitino-aliyun-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun-bundle)
is the Gravitino Aliyun jar with Hadoop environment(3.3.1) and `hadoop-oss`
jar.
+-
[`gravitino-aliyun-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aliyun)
is a condensed version of the Gravitino Aliyun bundle jar without Hadoop
environment and `hadoop-aliyun` jar.
+-`hadoop-aliyun-3.2.0.jar` and `aliyun-sdk-oss-2.8.3.jar` can be found in the
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is needed by the driver, adding
the bundle jars with '--jars' may not work. If this is the case, you should add
the jars to the spark CLASSPATH directly.
+:::
+
+### Accessing a fileset using the Hadoop fs command
+
+The following are examples of how to use the `hadoop fs` command to access the
fileset in Hadoop 3.1.3.
+
+1. Adding the following contents to the
`${HADOOP_HOME}/etc/hadoop/core-site.xml` file:
+
+```xml
+ <property>
+ <name>fs.AbstractFileSystem.gvfs.impl</name>
+ <value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
+ </property>
+
+ <property>
+ <name>fs.gvfs.impl</name>
+
<value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.server.uri</name>
+ <value>http://localhost:8090</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.client.metalake</name>
+ <value>test</value>
+ </property>
+
+ <property>
+ <name>oss-endpoint</name>
+ <value>http://oss-cn-hangzhou.aliyuncs.com</value>
+ </property>
+
+ <property>
+ <name>oss-access-key-id</name>
+ <value>access-key</value>
+ </property>
+
+ <property>
+ <name>oss-secret-access-key</name>
+ <value>secret-key</value>
+ </property>
+```
+
+2. Add the necessary jars to the Hadoop classpath.
+
+For OSS, you need to add
`gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`,
`gravitino-aliyun-${gravitino-version}.jar` and
`hadoop-aliyun-${hadoop-version}.jar` located at
`${HADOOP_HOME}/share/hadoop/tools/lib/` to Hadoop classpath.
+
+3. Run the following command to access the fileset:
+
+```shell
+./${HADOOP_HOME}/bin/hadoop dfs -ls
gvfs://fileset/oss_catalog/oss_schema/oss_fileset
+./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file
gvfs://fileset/oss_catalog/schema/oss_fileset
+```
+
+### Using the GVFS Python client to access a fileset
+
+In order to access fileset with OSS using the GVFS Python client, apart from
[basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to
add the following configurations:
+
+| Configuration item | Description | Default value
| Required | Since version |
+|-------------------------|-----------------------------------|---------------|----------|------------------|
+| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none)
| Yes | 0.7.0-incubating |
+| `oss_access_key_id` | The access key of the Aliyun OSS. | (none)
| Yes | 0.7.0-incubating |
+| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none)
| Yes | 0.7.0-incubating |
+
+:::note
+If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
+:::
+
+Please install the `gravitino` package before running the following code:
+
+```bash
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+
+```python
+from gravitino import gvfs
+options = {
+ "cache_size": 20,
+ "cache_expired_time": 3600,
+ "auth_type": "simple",
+ "oss_endpoint": "http://localhost:8090",
+ "oss_access_key_id": "minio",
+ "oss_secret_access_key": "minio123"
+}
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake", options=options)
+
+fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/")
+```
+
+
+### Using fileset with pandas
+
+The following are examples of how to use the pandas library to access the OSS
fileset
+
+```python
+import pandas as pd
+
+storage_options = {
+ "server_uri": "http://localhost:8090",
+ "metalake_name": "test",
+ "options": {
+ "oss_access_key_id": "access_key",
+ "oss_secret_access_key": "secret_key",
+ "oss_endpoint": "http://oss-cn-hangzhou.aliyuncs.com"
+ }
+}
+ds =
pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv",
+ storage_options=storage_options)
+ds.head()
+```
+For other use cases, please refer to the [Gravitino Virtual File
System](./how-to-use-gvfs.md) document.
+
+## Fileset with credential vending
+
+Since 0.8.0-incubating, Gravitino supports credential vending for OSS fileset.
If the catalog has been [configured with
credential](./security/credential-vending.md), you can access OSS fileset
without providing authentication information like `oss-access-key-id` and
`oss-secret-access-key` in the properties.
+
+### How to create a OSS Hadoop catalog with credential enabled
+
+Apart from configuration method in
[create-oss-hadoop-catalog](#configuration-for-an-oss-hadoop-catalog),
properties needed by
[oss-credential](./security/credential-vending.md#oss-credentials) should also
be set to enable credential vending for OSS fileset.
+
+### How to access OSS fileset with credential
+
+If the catalog has been configured with credential, you can access OSS fileset
without providing authentication information via GVFS Java/Python client and
Spark. Let's see how to access OSS fileset with credential:
+
+GVFS Java client:
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+// No need to set oss-access-key-id and oss-secret-access-key
+Path filesetPath = new
Path("gvfs://fileset/oss_test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Spark:
+
+```python
+spark = SparkSession.builder
+ .appName("oss_fileset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ # No need to set oss-access-key-id and oss-secret-access-key
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+```
+
+Python client and Hadoop command are similar to the above examples.
+
+
diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md
new file mode 100644
index 0000000000..7d56f2b9ab
--- /dev/null
+++ b/docs/hadoop-catalog-with-s3.md
@@ -0,0 +1,541 @@
+---
+title: "Hadoop catalog with S3"
+slug: /hadoop-catalog-with-s3
+date: 2025-01-03
+keyword: Hadoop catalog S3
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document explains how to configure a Hadoop catalog with S3 in Gravitino.
+
+## Prerequisites
+
+To create a Hadoop catalog with S3, follow these steps:
+
+1. Download the
[`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle)
file.
+2. Place this file in the Gravitino Hadoop catalog classpath at
`${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
+3. Start the Gravitino server using the following command:
+
+```bash
+$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start
+```
+
+Once the server is up and running, you can proceed to configure the Hadoop
catalog with S3. In the rest of this document we will use
`http://localhost:8090` as the Gravitino server URL, please replace it with
your actual server URL.
+
+## Configurations for creating a Hadoop catalog with S3
+
+### Configurations for S3 Hadoop Catalog
+
+In addition to the basic configurations mentioned in
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties),
the following properties are necessary to configure a Hadoop catalog with S3:
+
+| Configuration item | Description
[...]
+|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| `filesystem-providers` | The file system providers to add. Set it to
`s3` if it's a S3 fileset, or a comma separated string that contains `s3` like
`gs,s3` to support multiple kinds of fileset including `s3`.
[...]
+| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for S3, if we set this value, we can omit the prefix
's3a://' in the location.
[...]
+| `s3-endpoint` | The endpoint of the AWS S3. This
configuration is optional for S3 service, but required for other S3-compatible
storage services like MinIO.
[...]
+| `s3-access-key-id` | The access key of the AWS S3.
[...]
+| `s3-secret-access-key` | The secret key of the AWS S3.
[...]
+| `credential-providers` | The credential provider types, separated by
comma, possible value can be `s3-token`, `s3-secret-key`. As the default
authentication type is using AKSK as the above, this configuration can enable
credential vending provided by Gravitino server and client will no longer need
to provide authentication information like AKSK to access S3 by GVFS. Once it's
set, more configuration items are needed to make it works, please see
[s3-credential-vending](security/ [...]
+
+### Configurations for a schema
+
+To learn how to create a schema, refer to [Schema
configurations](./hadoop-catalog.md#schema-properties).
+
+### Configurations for a fileset
+
+For more details on creating a fileset, Refer to [Fileset
configurations](./hadoop-catalog.md#fileset-properties).
+
+
+## Using the Hadoop catalog with S3
+
+This section demonstrates how to use the Hadoop catalog with S3 in Gravitino,
with a complete example.
+
+### Step1: Create a Hadoop Catalog with S3
+
+First of all, you need to create a Hadoop catalog with S3. The following
example shows how to create a Hadoop catalog with S3:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_catalog",
+ "type": "FILESET",
+ "comment": "This is a S3 fileset catalog",
+ "provider": "hadoop",
+ "properties": {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key",
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
+ "filesystem-providers": "s3"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Map<String, String> s3Properties = ImmutableMap.<String, String>builder()
+ .put("location", "s3a://bucket/root")
+ .put("s3-access-key-id", "access_key")
+ .put("s3-secret-access-key", "secret_key")
+ .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
+ .put("filesystem-providers", "s3")
+ .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("test_catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a S3 fileset catalog",
+ s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+s3_properties = {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key"
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
+ "filesystem-providers": "s3"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="test_catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a S3 fileset
catalog",
+ properties=s3_properties)
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+When using S3 with Hadoop, ensure that the location value starts with s3a://
(not s3://) for AWS S3. For example, use s3a://bucket/root, as the s3:// format
is not supported by the hadoop-aws library.
+:::
+
+### Step2: Create a schema
+
+Once your Hadoop catalog with S3 is created, you can create a schema under the
catalog. Here are examples of how to do that:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "This is a S3 schema",
+ "properties": {
+ "location": "s3a://bucket/root/schema"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+Catalog catalog = gravitinoClient.loadCatalog("hive_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ .put("location", "s3a://bucket/root/schema")
+ .build();
+Schema schema = supportsSchemas.createSchema("test_schema",
+ "This is a S3 schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_schemas().create_schema(name="test_schema",
+ comment="This is a S3 schema",
+ properties={"location":
"s3a://bucket/root/schema"})
+```
+
+</TabItem>
+</Tabs>
+
+### Step3: Create a fileset
+
+After creating the schema, you can create a fileset. Here are examples for
creating a fileset:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation": "s3a://bucket/root/schema/example_fileset",
+ "properties": {
+ "k1": "v1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+ .put("k1", "v1")
+ .build();
+
+filesetCatalog.createFileset(
+ NameIdentifier.of("test_schema", "example_fileset"),
+ "This is an example fileset",
+ Fileset.Type.MANAGED,
+ "s3a://bucket/root/schema/example_fileset",
+ propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema",
"example_fileset"),
+ type=Fileset.Type.MANAGED,
+ comment="This is an example
fileset",
+
storage_location="s3a://bucket/root/schema/example_fileset",
+ properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Accessing a fileset with S3
+
+### Using the GVFS Java client to access the fileset
+
+To access fileset with S3 using the GVFS Java client, based on the [basic GVFS
configurations](./how-to-use-gvfs.md#configuration-1), you need to add the
following configurations:
+
+| Configuration item | Description
| Default value | Required | Since version |
+|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------|------------------|
+| `s3-endpoint` | The endpoint of the AWS S3. This configuration is
optional for S3 service, but required for other S3-compatible storage services
like MinIO. | (none) | No | 0.7.0-incubating |
+| `s3-access-key-id` | The access key of the AWS S3.
| (none) | Yes | 0.7.0-incubating |
+| `s3-secret-access-key` | The secret key of the AWS S3.
| (none) | Yes | 0.7.0-incubating |
+
+:::note
+- `s3-endpoint` is an optional configuration for AWS S3, however, it is
required for other S3-compatible storage services like MinIO.
+- If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
More details can be found in [Fileset with credential
vending](#fileset-with-credential-vending).
+:::
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+conf.set("s3-endpoint", "http://localhost:8090");
+conf.set("s3-access-key-id", "minio");
+conf.set("s3-secret-access-key", "minio123");
+
+Path filesetPath = new
Path("gvfs://fileset/adls_catalog/adls_schema/adls_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Similar to Spark configurations, you need to add S3 (bundle) jars to the
classpath according to your environment.
+
+```xml
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-common</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-aws</artifactId>
+ <version>${HADOOP_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-aws</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+Or use the bundle jar with Hadoop environment if there is no Hadoop
environment:
+
+
+```xml
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>gravitino-aws-bundle</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.gravitino</groupId>
+ <artifactId>filesystem-hadoop3-runtime</artifactId>
+ <version>${GRAVITINO_VERSION}</version>
+ </dependency>
+```
+
+### Using Spark to access the fileset
+
+The following Python code demonstrates how to use **PySpark 3.1.3 with Hadoop
environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+Then you can run the following code:
+
+```python
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "your_s3_catalog"
+schema_name = "your_s3_schema"
+fileset_name = "your_s3_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aws-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("s3_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"])
+ .config("spark.hadoop.s3-secret-access-key",
os.environ["S3_SECRET_ACCESS_KEY"])
+ .config("spark.hadoop.s3-endpoint",
"http://s3.ap-northeast-1.amazonaws.com")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path =
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+ .mode("overwrite")
+ .option("header", "true")
+ .csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code
snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the
same environment variables
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aws-bundle-${gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-${gravitino-version}-SNAPSHOT.jar
--master local[1] pyspark-shell"
+```
+
+-
[`gravitino-aws-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws-bundle)
is the Gravitino AWS jar with Hadoop environment(3.3.1) and `hadoop-aws` jar.
+-
[`gravitino-aws-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-aws)
is a condensed version of the Gravitino AWS bundle jar without Hadoop
environment and `hadoop-aws` jar.
+- `hadoop-aws-3.2.0.jar` and `aws-java-sdk-bundle-1.11.375.jar` can be found
in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib`
directory.
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is needed by the driver, adding
the bundle jars with '--jars' may not work. If this is the case, you should add
the jars to the spark CLASSPATH directly.
+:::
+
+### Accessing a fileset using the Hadoop fs command
+
+The following are examples of how to use the `hadoop fs` command to access the
fileset in Hadoop 3.1.3.
+
+1. Adding the following contents to the
`${HADOOP_HOME}/etc/hadoop/core-site.xml` file:
+
+```xml
+ <property>
+ <name>fs.AbstractFileSystem.gvfs.impl</name>
+ <value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
+ </property>
+
+ <property>
+ <name>fs.gvfs.impl</name>
+
<value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.server.uri</name>
+ <value>http://localhost:8090</value>
+ </property>
+
+ <property>
+ <name>fs.gravitino.client.metalake</name>
+ <value>test</value>
+ </property>
+
+ <property>
+ <name>s3-endpoint</name>
+ <value>http://s3.ap-northeast-1.amazonaws.com</value>
+ </property>
+
+ <property>
+ <name>s3-access-key-id</name>
+ <value>access-key</value>
+ </property>
+
+ <property>
+ <name>s3-secret-access-key</name>
+ <value>secret-key</value>
+ </property>
+```
+
+2. Add the necessary jars to the Hadoop classpath.
+
+For S3, you need to add
`gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`,
`gravitino-aws-${gravitino-version}.jar` and `hadoop-aws-${hadoop-version}.jar`
located at `${HADOOP_HOME}/share/hadoop/tools/lib/` to Hadoop classpath.
+
+3. Run the following command to access the fileset:
+
+```shell
+./${HADOOP_HOME}/bin/hadoop dfs -ls
gvfs://fileset/s3_catalog/s3_schema/s3_fileset
+./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file
gvfs://fileset/s3_catalog/s3_schema/s3_fileset
+```
+
+### Using the GVFS Python client to access a fileset
+
+In order to access fileset with S3 using the GVFS Python client, apart from
[basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), you need to
add the following configurations:
+
+| Configuration item | Description
| Default value | Required | Since version |
+|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------|------------------|
+| `s3_endpoint` | The endpoint of the AWS S3. This configuration is
optional for S3 service, but required for other S3-compatible storage services
like MinIO. | (none) | No | 0.7.0-incubating |
+| `s3_access_key_id` | The access key of the AWS S3.
| (none) | Yes | 0.7.0-incubating |
+| `s3_secret_access_key` | The secret key of the AWS S3.
| (none) | Yes | 0.7.0-incubating |
+
+:::note
+- `s3_endpoint` is an optional configuration for AWS S3, however, it is
required for other S3-compatible storage services like MinIO.
+- If the catalog has enabled [credential
vending](security/credential-vending.md), the properties above can be omitted.
+:::
+
+Please install the `gravitino` package before running the following code:
+
+```bash
+pip install apache-gravitino==${GRAVITINO_VERSION}
+```
+
+```python
+from gravitino import gvfs
+options = {
+ "cache_size": 20,
+ "cache_expired_time": 3600,
+ "auth_type": "simple",
+ "s3_endpoint": "http://localhost:8090",
+ "s3_access_key_id": "minio",
+ "s3_secret_access_key": "minio123"
+}
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake", options=options)
+fs.ls("gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/")
")
+```
+
+### Using fileset with pandas
+
+The following are examples of how to use the pandas library to access the S3
fileset
+
+```python
+import pandas as pd
+
+storage_options = {
+ "server_uri": "http://localhost:8090",
+ "metalake_name": "test",
+ "options": {
+ "s3_access_key_id": "access_key",
+ "s3_secret_access_key": "secret_key",
+ "s3_endpoint": "http://s3.ap-northeast-1.amazonaws.com"
+ }
+}
+ds =
pd.read_csv(f"gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/people/part-00000-51d366e2-d5eb-448d-9109-32a96c8a14dc-c000.csv",
+ storage_options=storage_options)
+ds.head()
+```
+
+For more use cases, please refer to the [Gravitino Virtual File
System](./how-to-use-gvfs.md) document.
+
+## Fileset with credential vending
+
+Since 0.8.0-incubating, Gravitino supports credential vending for S3 fileset.
If the catalog has been [configured with
credential](./security/credential-vending.md), you can access S3 fileset
without providing authentication information like `s3-access-key-id` and
`s3-secret-access-key` in the properties.
+
+### How to create a S3 Hadoop catalog with credential enabled
+
+Apart from configuration method in
[create-s3-hadoop-catalog](#configurations-for-s3-hadoop-catalog), properties
needed by [s3-credential](./security/credential-vending.md#s3-credentials)
should also be set to enable credential vending for S3 fileset.
+
+### How to access S3 fileset with credential
+
+If the catalog has been configured with credential, you can access S3 fileset
without providing authentication information via GVFS Java/Python client and
Spark. Let's see how to access S3 fileset with credential:
+
+GVFS Java client:
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri", "http://localhost:8090");
+conf.set("fs.gravitino.client.metalake", "test_metalake");
+// No need to set s3-access-key-id and s3-secret-access-key
+Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Spark:
+
+```python
+spark = SparkSession.builder
+ .appName("s3_fileset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ # No need to set s3-access-key-id and s3-secret-access-key
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+```
+
+Python client and Hadoop command are similar to the above examples.
+
+
diff --git a/docs/hadoop-catalog.md b/docs/hadoop-catalog.md
index cbdae84689..4b951aedc6 100644
--- a/docs/hadoop-catalog.md
+++ b/docs/hadoop-catalog.md
@@ -9,9 +9,9 @@ license: "This software is licensed under the Apache License
version 2."
## Introduction
Hadoop catalog is a fileset catalog that using Hadoop Compatible File System
(HCFS) to manage
-the storage location of the fileset. Currently, it supports local filesystem
and HDFS. For
-object storage like S3, GCS, Azure Blob Storage and OSS, you can put the
hadoop object store jar like
-`gravitino-aws-bundle-{gravitino-version}.jar` into the
`$GRAVITINO_HOME/catalogs/hadoop/libs` directory to enable the support.
+the storage location of the fileset. Currently, it supports the local
filesystem and HDFS. Since 0.7.0-incubating, Gravitino supports
[S3](hadoop-catalog-with-S3.md), [GCS](hadoop-catalog-with-gcs.md),
[OSS](hadoop-catalog-with-oss.md) and [Azure Blob
Storage](hadoop-catalog-with-adls.md) through Hadoop catalog.
+
+The rest of this document will use HDFS or local file as an example to
illustrate how to use the Hadoop catalog. For S3, GCS, OSS and Azure Blob
Storage, the configuration is similar to HDFS, please refer to the
corresponding document for more details.
Note that Gravitino uses Hadoop 3 dependencies to build Hadoop catalog.
Theoretically, it should be
compatible with both Hadoop 2.x and 3.x, since Gravitino doesn't leverage any
new features in
@@ -23,17 +23,19 @@ Hadoop 3. If there's any compatibility issue, please create
an [issue](https://g
Besides the [common catalog
properties](./gravitino-server-config.md#apache-gravitino-catalog-properties-configuration),
the Hadoop catalog has the following properties:
-| Property Name | Description
| Default Value |
Required | Since Version |
-|--------------------------------|-----------------------------------------------------------------------------------------------------|---------------|----------|------------------|
-| `location` | The storage location managed by Hadoop
catalog. | (none) |
No | 0.5.0 |
-| `filesystem-conn-timeout-secs` | The timeout of getting the file system
using Hadoop FileSystem client instance. Time unit: seconds. | 6 |
No | 0.8.0-incubating |
-| `credential-providers` | The credential provider types, separated by
comma. | (none) | No
| 0.8.0-incubating |
+| Property Name | Description
| Default Value | Required | Since Version |
+|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------|
+| `location` | The storage location managed by Hadoop
catalog.
| (none) | No | 0.5.0 |
+| `default-filesystem-provider` | The default filesystem provider of this
Hadoop catalog if users do not specify the scheme in the URI. Candidate values
are 'builtin-local', 'builtin-hdfs', 's3', 'gcs', 'abs' and 'oss'. Default
value is `builtin-local`. For S3, if we set this value to 's3', we can omit the
prefix 's3a://' in the location. | `builtin-local` | No |
0.7.0-incubating |
+| `filesystem-providers` | The file system providers to add. Users
needs to set this configuration to support cloud storage or custom HCFS. For
instance, set it to `s3` or a comma separated string that contains `s3` like
`gs,s3` to support multiple kinds of fileset including `s3`.
| (none) | Yes |
0.7.0-incubating |
+| `credential-providers` | The credential provider types, separated by
comma.
| (none) | No | 0.8.0-incubating |
+| `filesystem-conn-timeout-secs` | The timeout of getting the file system
using Hadoop FileSystem client instance. Time unit: seconds.
| 6 | No | 0.8.0-incubating |
Please refer to [Credential vending](./security/credential-vending.md) for
more details about credential vending.
-Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or
custom fileset, you need to configure the following extra properties.
+### HDFS fileset
-#### HDFS fileset
+Apart from the above properties, to access fileset like HDFS fileset, you need
to configure the following extra properties.
| Property Name | Description
|
Default Value | Required |
Since Version |
|----------------------------------------------------|------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------------------|---------------|
@@ -44,66 +46,13 @@ Apart from the above properties, to access fileset like
HDFS, S3, GCS, OSS or cu
| `authentication.kerberos.check-interval-sec` | The check interval of
Kerberos credential for Hadoop catalog. | 60
| No | 0.5.1
|
| `authentication.kerberos.keytab-fetch-timeout-sec` | The fetch timeout of
retrieving Kerberos keytab from `authentication.kerberos.keytab-uri`. | 60
| No | 0.5.1
|
-#### S3 fileset
-
-| Configuration item | Description
| Default value | Required | Since version
|
-|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|------------------|
-| `filesystem-providers` | The file system providers to add. Set it to
`s3` if it's a S3 fileset, or a comma separated string that contains `s3` like
`gs,s3` to support multiple kinds of fileset including `s3`.
| (none) | Yes |
0.7.0-incubating |
-| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for S3, if we set this value, we can omit the prefix
's3a://' in the location. | `builtin-local` | No |
0.7.0-incubating |
-| `s3-endpoint` | The endpoint of the AWS S3.
| (none) | Yes if it's a S3 fileset. |
0.7.0-incubating |
-| `s3-access-key-id` | The access key of the AWS S3.
| (none) | Yes if it's a S3 fileset. |
0.7.0-incubating |
-| `s3-secret-access-key` | The secret key of the AWS S3.
| (none) | Yes if it's a S3 fileset. |
0.7.0-incubating |
-
-Please refer to [S3
credentials](./security/credential-vending.md#s3-credentials) for credential
related configurations.
-
-At the same time, you need to place the corresponding bundle jar
[`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/)
in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
-
-#### GCS fileset
-
-| Configuration item | Description
| Default value | Required | Since version
|
-|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------|
-| `filesystem-providers` | The file system providers to add. Set it to
`gs` if it's a GCS fileset, a comma separated string that contains `gs` like
`gs,s3` to support multiple kinds of fileset including `gs`.
| (none) | Yes |
0.7.0-incubating |
-| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for GCS, if we set this value, we can omit the prefix
'gs://' in the location. | `builtin-local` | No |
0.7.0-incubating |
-| `gcs-service-account-file` | The path of GCS service account JSON file.
| (none) | Yes if it's a GCS fileset. |
0.7.0-incubating |
-
-Please refer to [GCS
credentials](./security/credential-vending.md#gcs-credentials) for credential
related configurations.
-
-In the meantime, you need to place the corresponding bundle jar
[`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/)
in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
-
-#### OSS fileset
-
-| Configuration item | Description
| Default value | Required | Since version
|
-|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------|
-| `filesystem-providers` | The file system providers to add. Set it to
`oss` if it's a OSS fileset, or a comma separated string that contains `oss`
like `oss,gs,s3` to support multiple kinds of fileset including `oss`.
| (none) | Yes |
0.7.0-incubating |
-| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for OSS, if we set this value, we can omit the prefix
'oss://' in the location. | `builtin-local` | No |
0.7.0-incubating |
-| `oss-endpoint` | The endpoint of the Aliyun OSS.
| (none) | Yes if it's a OSS fileset. |
0.7.0-incubating |
-| `oss-access-key-id` | The access key of the Aliyun OSS.
| (none) | Yes if it's a OSS fileset. |
0.7.0-incubating |
-| `oss-secret-access-key` | The secret key of the Aliyun OSS.
| (none) | Yes if it's a OSS fileset. |
0.7.0-incubating |
-
-Please refer to [OSS
credentials](./security/credential-vending.md#oss-credentials) for credential
related configurations.
-
-In the meantime, you need to place the corresponding bundle jar
[`gravitino-aliyun-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/)
in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
-
-
-#### Azure Blob Storage fileset
-
-| Configuration item | Description
| Default value | Required
| Since version |
-|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|-------------------------------------------|------------------|
-| `filesystem-providers` | The file system providers to add. Set it
to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that
contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including
`abs`. | (none) | Yes
| 0.8.0-incubating |
-| `default-filesystem-provider` | The name default filesystem providers of
this Hadoop catalog if users do not specify the scheme in the URI. Default
value is `builtin-local`, for Azure Blob Storage, if we set this value, we can
omit the prefix 'abfss://' in the location. | `builtin-local` | No
| 0.8.0-incubating |
-| `azure-storage-account-name ` | The account name of Azure Blob Storage.
| (none) | Yes if it's a Azure
Blob Storage fileset. | 0.8.0-incubating |
-| `azure-storage-account-key` | The account key of Azure Blob Storage.
| (none) | Yes if it's a Azure
Blob Storage fileset. | 0.8.0-incubating |
-
-Please refer to [ADLS
credentials](./security/credential-vending.md#adls-credentials) for credential
related configurations.
-
-Similar to the above, you need to place the corresponding bundle jar
[`gravitino-azure-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/)
in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
-
-:::note
-- Gravitino contains builtin file system providers for local file
system(`builtin-local`) and HDFS(`builtin-hdfs`), that is to say if
`filesystem-providers` is not set, Gravitino will still support local file
system and HDFS. Apart from that, you can set the `filesystem-providers` to
support other file systems like S3, GCS, OSS or custom file system.
-- `default-filesystem-provider` is used to set the default file system
provider for the Hadoop catalog. If the user does not specify the scheme in the
URI, Gravitino will use the default file system provider to access the fileset.
For example, if the default file system provider is set to `builtin-local`, the
user can omit the prefix `file:///` in the location.
-:::
+### Hadoop catalog with Cloud Storage
+- For S3, please refer to
[Hadoop-catalog-with-s3](./hadoop-catalog-with-s3.md) for more details.
+- For GCS, please refer to
[Hadoop-catalog-with-gcs](./hadoop-catalog-with-gcs.md) for more details.
+- For OSS, please refer to
[Hadoop-catalog-with-oss](./hadoop-catalog-with-oss.md) for more details.
+- For Azure Blob Storage, please refer to
[Hadoop-catalog-with-adls](./hadoop-catalog-with-adls.md) for more details.
-#### How to custom your own HCFS file system fileset?
+### How to custom your own HCFS file system fileset?
Developers and users can custom their own HCFS file system fileset by
implementing the `FileSystemProvider` interface in the jar
[gravitino-catalog-hadoop](https://repo1.maven.org/maven2/org/apache/gravitino/catalog-hadoop/).
The `FileSystemProvider` interface is defined as follows:
diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md
index aff3b74adf..cbbb67dd37 100644
--- a/docs/how-to-use-gvfs.md
+++ b/docs/how-to-use-gvfs.md
@@ -42,7 +42,9 @@ the path mapping and convert automatically.
### Prerequisites
-+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS)
implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1.
It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop
2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues)
if you find any compatibility issues.
+ - GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop
3.3.1 or later, but it should work with Hadoop 2.
+ x. Please create an [issue](https://www.github.com/apache/gravitino/issues)
if you find any
+ compatibility issues.
### Configuration
@@ -64,55 +66,8 @@ the path mapping and convert automatically.
| `fs.gravitino.fileset.cache.evictionMillsAfterAccess` | The value of time
that the cache expires after accessing in the Gravitino Virtual File System.
The value is in `milliseconds`.
| `3600000` | No
| 0.5.0 |
| `fs.gravitino.fileset.cache.evictionMillsAfterAccess` | The value of time
that the cache expires after accessing in the Gravitino Virtual File System.
The value is in `milliseconds`.
| `3600000` | No
| 0.5.0 |
-Apart from the above properties, to access fileset like S3, GCS, OSS and
custom fileset, you need to configure the following extra properties.
-
-#### S3 fileset
-
-| Configuration item | Description | Default value |
Required | Since version |
-|------------------------|-------------------------------|---------------|---------------------------|------------------|
-| `s3-endpoint` | The endpoint of the AWS S3. | (none) | Yes
if it's a S3 fileset. | 0.7.0-incubating |
-| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes
if it's a S3 fileset. | 0.7.0-incubating |
-| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes
if it's a S3 fileset. | 0.7.0-incubating |
-
-At the same time, you need to add the corresponding bundle jar
-1.
[`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/)
in the classpath if no hadoop environment is available, or
-2.
[`gravitino-aws-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/)
and hadoop-aws jar and other necessary dependencies in the classpath.
-
-
-#### GCS fileset
-
-| Configuration item | Description |
Default value | Required | Since version |
-|----------------------------|--------------------------------------------|---------------|----------------------------|------------------|
-| `gcs-service-account-file` | The path of GCS service account JSON file. |
(none) | Yes if it's a GCS fileset. | 0.7.0-incubating |
-
-In the meantime, you need to add the corresponding bundle jar
-1.
[`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/)
in the classpath if no hadoop environment is available, or
-2. or
[`gravitino-gcp-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp/)
and [gcs-connector
jar](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) and
other necessary dependencies in the classpath.
-
-
-#### OSS fileset
-
-| Configuration item | Description
| Default
value | Required | Since version |
-|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------------------|------------------|
-| `oss-endpoint` | The endpoint of the Aliyun OSS.
| (none)
| Yes if it's a OSS fileset.| 0.7.0-incubating |
-| `oss-access-key-id` | The access key of the Aliyun OSS.
| (none)
| Yes if it's a OSS fileset.| 0.7.0-incubating |
-| `oss-secret-access-key` | The secret key of the Aliyun OSS.
| (none)
| Yes if it's a OSS fileset.| 0.7.0-incubating |
-
-
-In the meantime, you need to place the corresponding bundle jar
-1.
[`gravitino-aliyun-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun-bundle/)
in the classpath if no hadoop environment is available, or
-2.
[`gravitino-aliyun-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aliyun/)
and hadoop-aliyun jar and other necessary dependencies in the classpath.
-
-#### Azure Blob Storage fileset
-
-| Configuration item | Description
| Default value | Required | Since
version |
-|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------------|------------------|
-| `azure-storage-account-name` | The account name of Azure Blob Storage.
| (none) | Yes if it's a Azure Blob Storage fileset. |
0.8.0-incubating |
-| `azure-storage-account-key` | The account key of Azure Blob Storage.
| (none) | Yes if it's a Azure Blob Storage fileset. |
0.8.0-incubating |
-
-Similar to the above, you need to place the corresponding bundle jar
-1.
[`gravitino-azure-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure-bundle/)
in the classpath if no hadoop environment is available, or
-2.
[`gravitino-azure-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-azure/)
and hadoop-azure jar and other necessary dependencies in the classpath.
+Apart from the above properties, to access fileset like S3, GCS, OSS and
custom fileset, extra properties are needed, please see
+[S3 GVFS Java client
configurations](./hadoop-catalog-with-s3.md#using-the-gvfs-java-client-to-access-the-fileset),
[GCS GVFS Java client
configurations](./hadoop-catalog-with-gcs.md#using-the-gvfs-java-client-to-access-the-fileset),
[OSS GVFS Java client
configurations](./hadoop-catalog-with-oss.md#using-the-gvfs-java-client-to-access-the-fileset)
and [Azure Blob Storage GVFS Java client
configurations](./hadoop-catalog-with-adls.md#using-the-gvfs-java-client-to-access-the-fileset)
for [...]
#### Custom fileset
Since 0.7.0-incubating, users can define their own fileset type and configure
the corresponding properties, for more, please refer to [Custom
Fileset](./hadoop-catalog.md#how-to-custom-your-own-hcfs-file-system-fileset).
@@ -132,26 +87,10 @@ You can configure these properties in two ways:
conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
conf.set("fs.gravitino.server.uri","http://localhost:8090");
conf.set("fs.gravitino.client.metalake","test_metalake");
-
- // Optional. It's only for S3 catalog. For GCS and OSS catalog, you should
set the corresponding properties.
- conf.set("s3-endpoint", "http://localhost:9000");
- conf.set("s3-access-key-id", "minio");
- conf.set("s3-secret-access-key", "minio123");
-
Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_fileset_1");
FileSystem fs = filesetPath.getFileSystem(conf);
```
-:::note
-If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart
from the above properties, you need to place the corresponding bundle jars in
the Hadoop environment.
-For example, if you want to access the S3 fileset, you need to place
-1. The aws hadoop bundle jar
[`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/)
-2. or
[`gravitino-aws-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws/),
and hadoop-aws jar and other necessary dependencies
-
-to the classpath, it typically locates in
`${HADOOP_HOME}/share/hadoop/common/lib/`).
-
-:::
-
2. Configure the properties in the `core-site.xml` file of the Hadoop
environment:
```xml
@@ -174,20 +113,6 @@ to the classpath, it typically locates in
`${HADOOP_HOME}/share/hadoop/common/li
<name>fs.gravitino.client.metalake</name>
<value>test_metalake</value>
</property>
-
- <!-- Optional. It's only for S3 catalog. For GCs and OSS catalog, you
should set the corresponding properties. -->
- <property>
- <name>s3-endpoint</name>
- <value>http://localhost:9000</value>
- </property>
- <property>
- <name>s3-access-key-id</name>
- <value>minio</value>
- </property>
- <property>
- <name>s3-secret-access-key</name>
- <value>minio123</value>
- </property>
```
### Usage examples
@@ -223,12 +148,6 @@ cp gravitino-filesystem-hadoop3-runtime-{version}.jar
${HADOOP_HOME}/share/hadoo
# You need to ensure that the Kerberos has permission on the HDFS directory.
kinit -kt your_kerberos.keytab [email protected]
-
-# 4. Copy other dependencies to the Hadoop environment if you want to access
the S3 fileset via GVFS
-cp bundles/aws-bundle/build/libs/gravitino-aws-bundle-{version}.jar
${HADOOP_HOME}/share/hadoop/common/lib/
-cp
clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-{version}-SNAPSHOT.jar
${HADOOP_HOME}/share/hadoop/common/lib/
-cp ${HADOOP_HOME}/share/hadoop/tools/lib/*
${HADOOP_HOME}/share/hadoop/common/lib/
-
# 4. Try to list the fileset
./${HADOOP_HOME}/bin/hadoop dfs -ls
gvfs://fileset/test_catalog/test_schema/test_fileset_1
```
@@ -239,36 +158,6 @@ You can also perform operations on the files or
directories managed by fileset t
Make sure that your code is using the correct Hadoop environment, and that
your environment
has the `gravitino-filesystem-hadoop3-runtime-{version}.jar` dependency.
-```xml
-
-<dependency>
- <groupId>org.apache.gravitino</groupId>
- <artifactId>filesystem-hadoop3-runtime</artifactId>
- <version>{gravitino-version}</version>
-</dependency>
-
-<!-- Use the following one if there is not hadoop environment -->
-<dependency>
- <groupId>org.apache.gravitino</groupId>
- <artifactId>gravitino-aws-bundle</artifactId>
- <version>{gravitino-version}</version>
-</dependency>
-
-<!-- Use the following one if there already have hadoop environment -->
-<dependency>
- <groupId>org.apache.gravitino</groupId>
- <artifactId>gravitino-aws</artifactId>
- <version>{gravitino-version}</version>
-</dependency>
-
-<dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-aws</artifactId>
- <version>{hadoop-version}</version>
-</dependency>
-
-```
-
For example:
```java
@@ -321,7 +210,6 @@ fs.getFileStatus(filesetPath);
rdd.foreach(println)
```
-
#### Via Tensorflow
For Tensorflow to support GVFS, you need to recompile the
[tensorflow-io](https://github.com/tensorflow/io) module.
@@ -468,61 +356,14 @@ to recompile the native libraries like `libhdfs` and
others, and completely repl
| `oauth2_scope` | The auth scope for the Gravitino client when
using `oauth2` auth type with the Gravitino Virtual File System.
| (none) | Yes if you use `oauth2` auth type | 0.7.0-incubating |
| `credential_expiration_ratio` | The ratio of expiration time for credential
from Gravitino. This is used in the cases where Gravitino Hadoop catalogs have
enable credential vending. if the expiration time of credential fetched from
Gravitino is 1 hour, GVFS client will try to refresh the credential in 1 * 0.9
= 0.5 hour. | 0.5 | No |
0.8.0-incubating |
+#### Configurations for S3, GCS, OSS and Azure Blob storage fileset
-#### Extra configuration for S3, GCS, OSS fileset
-
-The following properties are required if you want to access the S3 fileset via
the GVFS python client:
-
-| Configuration item | Description | Default value |
Required | Since version |
-|----------------------------|------------------------------|---------------|--------------------------|------------------|
-| `s3_endpoint` | The endpoint of the AWS S3. | (none) |
Yes if it's a S3 fileset.| 0.7.0-incubating |
-| `s3_access_key_id` | The access key of the AWS S3.| (none) |
Yes if it's a S3 fileset.| 0.7.0-incubating |
-| `s3_secret_access_key` | The secret key of the AWS S3.| (none) |
Yes if it's a S3 fileset.| 0.7.0-incubating |
-
-The following properties are required if you want to access the GCS fileset
via the GVFS python client:
-
-| Configuration item | Description |
Default value | Required | Since version |
-|----------------------------|-------------------------------------------|---------------|---------------------------|------------------|
-| `gcs_service_account_file` | The path of GCS service account JSON file.|
(none) | Yes if it's a GCS fileset.| 0.7.0-incubating |
-
-The following properties are required if you want to access the OSS fileset
via the GVFS python client:
-
-| Configuration item | Description | Default
value | Required | Since version |
-|----------------------------|-----------------------------------|---------------|----------------------------|------------------|
-| `oss_endpoint` | The endpoint of the Aliyun OSS. | (none)
| Yes if it's a OSS fileset. | 0.7.0-incubating |
-| `oss_access_key_id` | The access key of the Aliyun OSS. | (none)
| Yes if it's a OSS fileset. | 0.7.0-incubating |
-| `oss_secret_access_key` | The secret key of the Aliyun OSS. | (none)
| Yes if it's a OSS fileset. | 0.7.0-incubating |
-
-For Azure Blob Storage fileset, you need to configure the following properties:
-
-| Configuration item | Description | Default value
| Required | Since version |
-|--------------------|----------------------------------------|---------------|-------------------------------------------|------------------|
-| `abs_account_name` | The account name of Azure Blob Storage | (none)
| Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating |
-| `abs_account_key` | The account key of Azure Blob Storage | (none)
| Yes if it's a Azure Blob Storage fileset. | 0.8.0-incubating |
-
-
-You can configure these properties when obtaining the `Gravitino Virtual
FileSystem` in Python like this:
-
-```python
-from gravitino import gvfs
-options = {
- "cache_size": 20,
- "cache_expired_time": 3600,
- "auth_type": "simple",
- # Optional, the following properties are required if you want to access
the S3 fileset via GVFS python client, for GCS and OSS fileset, you should set
the corresponding properties.
- "s3_endpoint": "http://localhost:9000",
- "s3_access_key_id": "minio",
- "s3_secret_access_key": "minio123"
-}
-fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake", options=options)
-```
+Please see the cloud-storage-specific configurations [GCS GVFS Java client
configurations](./hadoop-catalog-with-gcs.md#using-the-gvfs-python-client-to-access-a-fileset),
[S3 GVFS Java client
configurations](./hadoop-catalog-with-s3.md#using-the-gvfs-python-client-to-access-a-fileset),
[OSS GVFS Java client
configurations](./hadoop-catalog-with-oss.md#using-the-gvfs-python-client-to-access-a-fileset)
and [Azure Blob Storage GVFS Java client
configurations](./hadoop-catalog-with-adls.md#u [...]
:::note
-
Gravitino python client does not support [customized file
systems](hadoop-catalog.md#how-to-custom-your-own-hcfs-file-system-fileset)
defined by users due to the limit of `fsspec` library.
:::
-
### Usage examples
1. Make sure to obtain the Gravitino library.
diff --git a/docs/manage-fileset-metadata-using-gravitino.md
b/docs/manage-fileset-metadata-using-gravitino.md
index 9d96287b56..0ff84c8346 100644
--- a/docs/manage-fileset-metadata-using-gravitino.md
+++ b/docs/manage-fileset-metadata-using-gravitino.md
@@ -15,7 +15,9 @@ filesets to manage non-tabular data like training datasets
and other raw data.
Typically, a fileset is mapped to a directory on a file system like HDFS, S3,
ADLS, GCS, etc.
With the fileset managed by Gravitino, the non-tabular data can be managed as
assets together with
-tabular data in Gravitino in a unified way.
+tabular data in Gravitino in a unified way. The following operations will use
HDFS as an example, for other
+HCFS like S3, OSS, GCS, etc, please refer to the corresponding operations
[hadoop-with-s3](./hadoop-catalog-with-s3.md),
[hadoop-with-oss](./hadoop-catalog-with-oss.md),
[hadoop-with-gcs](./hadoop-catalog-with-gcs.md) and
+[hadoop-with-adls](./hadoop-catalog-with-adls.md).
After a fileset is created, users can easily access, manage the
files/directories through
the fileset's identifier, without needing to know the physical path of the
managed dataset. Also, with
@@ -53,24 +55,6 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
}
}' http://localhost:8090/api/metalakes/metalake/catalogs
-# create a S3 catalog
-curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
--H "Content-Type: application/json" -d '{
- "name": "catalog",
- "type": "FILESET",
- "comment": "comment",
- "provider": "hadoop",
- "properties": {
- "location": "s3a://bucket/root",
- "s3-access-key-id": "access_key",
- "s3-secret-access-key": "secret_key",
- "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
- "filesystem-providers": "s3"
- }
-}' http://localhost:8090/api/metalakes/metalake/catalogs
-
-# For others HCFS like GCS, OSS, etc., the properties should be set
accordingly. please refer to
-# The following link about the catalog properties.
```
</TabItem>
@@ -93,25 +77,8 @@ Catalog catalog = gravitinoClient.createCatalog("catalog",
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a Hadoop fileset catalog",
properties);
-
-// create a S3 catalog
-s3Properties = ImmutableMap.<String, String>builder()
- .put("location", "s3a://bucket/root")
- .put("s3-access-key-id", "access_key")
- .put("s3-secret-access-key", "secret_key")
- .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
- .put("filesystem-providers", "s3")
- .build();
-
-Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
- Type.FILESET,
- "hadoop", // provider, Gravitino only supports "hadoop" for now.
- "This is a S3 fileset catalog",
- s3Properties);
// ...
-// For others HCFS like GCS, OSS, etc., the properties should be set
accordingly. please refer to
-// The following link about the catalog properties.
```
</TabItem>
@@ -124,23 +91,6 @@ catalog = gravitino_client.create_catalog(name="catalog",
provider="hadoop",
comment="This is a Hadoop fileset
catalog",
properties={"location":
"/tmp/test1"})
-
-# create a S3 catalog
-s3_properties = {
- "location": "s3a://bucket/root",
- "s3-access-key-id": "access_key"
- "s3-secret-access-key": "secret_key",
- "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
-}
-
-s3_catalog = gravitino_client.create_catalog(name="catalog",
- type=Catalog.Type.FILESET,
- provider="hadoop",
- comment="This is a S3 fileset
catalog",
- properties=s3_properties)
-
-# For others HCFS like GCS, OSS, etc., the properties should be set
accordingly. please refer to
-# The following link about the catalog properties.
```
</TabItem>
@@ -371,11 +321,8 @@ The `storageLocation` is the physical location of the
fileset. Users can specify
when creating a fileset, or follow the rules of the catalog/schema location if
not specified.
The value of `storageLocation` depends on the configuration settings of the
catalog:
-- If this is a S3 fileset catalog, the `storageLocation` should be in the
format of `s3a://bucket-name/path/to/fileset`.
-- If this is an OSS fileset catalog, the `storageLocation` should be in the
format of `oss://bucket-name/path/to/fileset`.
- If this is a local fileset catalog, the `storageLocation` should be in the
format of `file:///path/to/fileset`.
- If this is a HDFS fileset catalog, the `storageLocation` should be in the
format of `hdfs://namenode:port/path/to/fileset`.
-- If this is a GCS fileset catalog, the `storageLocation` should be in the
format of `gs://bucket-name/path/to/fileset`.
For a `MANAGED` fileset, the storage location is: