Re: [PR] [#5472] improvement(docs): Add example to use cloud stroage fileset and polish hadoop-catalog document. [gravitino]

via GitHub Thu, 02 Jan 2025 16:46:29 -0800


tengqm commented on code in PR #6059:
URL: https://github.com/apache/gravitino/pull/6059#discussion_r1901392704



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the 
jars that contain all the necessary dependencies to access the corresponding 
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` 
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and 
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the 
runtime.

Review Comment:
   Always follow the existing spelling of established words, e.g. Hadoop, 
Apache, AWS.
   
   ```suggestion
   They are used when there is no Hadoop environment in the runtime.
   ```



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the 
jars that contain all the necessary dependencies to access the corresponding 
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` 
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and 
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the 
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the 
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not 
contain the cloud storage classes (like hadoop-aws) and hadoop environment. 
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud 
storage filesets:
+
+| Hadoop runtime version | S3                                                  
                                                                                
     | GCS                                                                      
                                    | OSS                                       
                                                                                
                 | ABS                                                          
                                                      |
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment  | `gravitino-aws-bundle-${gravitino-version}.jar`     
                                                                                
     | `gravitino-gcp-bundle-${gravitino-version}.jar`                          
                                    | 
`gravitino-aliyun-bundle-${gravitino-version}.jar`                              
                                                           | 
`gravitino-azure-bundle-${gravitino-version}.jar`                               
                                   |
+| 2.x, 3.x               | `gravitino-aws-${gravitino-version}.jar`, 
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other 
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, 
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies | 
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, 
aliyun-sdk-java-{version} and other necessary dependencies | 
`gravitino-azure-${gravitino-version}.jar`, 
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` 
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get 
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS 
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) 
for hadoop2 or hadoop3. 
+
+If there still have some issues, please report it to the Gravitino community 
and create an issue. 

Review Comment:
   Let's make the users' life easier by providing a link.
   Or else they have to check where is the community and/or where is the repo.
   
   ```suggestion
   If there are still issues, please consider [filing an issue](...) 
   ```



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the 
jars that contain all the necessary dependencies to access the corresponding 
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` 
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and 
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the 
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the 
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not 
contain the cloud storage classes (like hadoop-aws) and hadoop environment. 
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud 
storage filesets:
+
+| Hadoop runtime version | S3                                                  
                                                                                
     | GCS                                                                      
                                    | OSS                                       
                                                                                
                 | ABS                                                          
                                                      |
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment  | `gravitino-aws-bundle-${gravitino-version}.jar`     
                                                                                
     | `gravitino-gcp-bundle-${gravitino-version}.jar`                          
                                    | 
`gravitino-aliyun-bundle-${gravitino-version}.jar`                              
                                                           | 
`gravitino-azure-bundle-${gravitino-version}.jar`                               
                                   |
+| 2.x, 3.x               | `gravitino-aws-${gravitino-version}.jar`, 
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other 
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, 
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies | 
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, 
aliyun-sdk-java-{version} and other necessary dependencies | 
`gravitino-azure-${gravitino-version}.jar`, 
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` 
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get 
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS 
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) 
for hadoop2 or hadoop3. 
+
+If there still have some issues, please report it to the Gravitino community 
and create an issue. 
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset 
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key",
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";,
+    "filesystem-providers": "s3"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+    .put("location", "s3a://bucket/root")
+    .put("s3-access-key-id", "access_key")
+    .put("s3-secret-access-key", "secret_key")
+    .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com";)
+    .put("filesystem-providers", "s3")
+    .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a S3 fileset catalog",
+    s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+s3_properties = {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key"
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a S3 fileset 
catalog",
+                                             properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3` for AWS S3, for 
instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported 
due to the limitation of the hadoop-aws library.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file",
+    "filesystem-providers": "gcs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+    .put("location", "gs://bucket/root")
+    .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+    .put("filesystem-providers", "gcs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a GCS fileset catalog",
+    gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+gcs_properties = {
+    "location": "gcs://bucket/root",
+    "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a GCS fileset 
catalog",
+                                             properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The prefix of a GCS location should always start with `gs` for instance, 
`gs://bucket/root`.
+:::
+
+### Create an OSS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "oss://bucket/root",
+    "oss-access-key-id": "access_key",
+    "oss-secret-access-key": "secret_key",
+    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com";,
+    "filesystem-providers": "oss"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+ossProperties = ImmutableMap.<String, String>builder()
+    .put("location", "oss://bucket/root")
+    .put("oss-access-key-id", "access_key")
+    .put("oss-secret-access-key", "secret_key")
+    .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com";)
+    .put("filesystem-providers", "oss")
+    .build();
+
+Catalog ossProperties = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a OSS fileset catalog",
+    ossProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+oss_properties = {
+    "location": "oss://bucket/root",
+    "oss-access-key-id": "access_key"
+    "oss-secret-access-key": "secret_key",
+    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com";
+}
+
+oss_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a OSS fileset 
catalog",
+                                             properties=oss_properties)
+
+```
+
+### Create an ABS (Azure Blob Storage or ADLS) fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "abfss://container/root",
+    "abs-account-name": "The account name of the Azure Blob Storage",
+    "abs-account-key": "The account key of the Azure Blob Storage",
+    "filesystem-providers": "abs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+absProperties = ImmutableMap.<String, String>builder()
+    .put("location", "abfss://container/root")
+    .put("abs-account-name", "The account name of the Azure Blob Storage")
+    .put("abs-account-key", "The account key of the Azure Blob Storage")
+    .put("filesystem-providers", "abs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a Azure Blob storage fileset catalog",
+    absProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+abs_properties = {
+    "location": "gcs://bucket/root",
+    "abs_account_name": "The account name of the Azure Blob Storage",
+    "abs_account_key": "The account key of the Azure Blob Storage"  
+}
+
+abs_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a Azure Blob 
Storage fileset catalog",
+                                             properties=abs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+note:::
+The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always 
start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value 
like `abfs://container/root` is not supported.
+:::
+
+
+## Create fileset schema
+
+This part is the same for all cloud storage filesets, you can create the 
schema by the following sentence:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "schema",
+  "comment": "comment",
+  "properties": {
+    "location": "file:///tmp/root/schema"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+// Assuming you have just created a Hadoop catalog named `catalog`
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+    // Property "location" is optional, if specified all the managed fileset 
without
+    // specifying storage location will be stored under this location.
+    .put("location", "file:///tmp/root/schema")
+    .build();
+Schema schema = supportsSchemas.createSchema("schema",
+    "This is a schema",
+    schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+You can change the value of property `location` according to which catalog you 
are using, moreover, if we have set the `location` property in the catalog, we 
can omit the `location` property in the schema.

Review Comment:
   ```suggestion
   You can change the `location` value based on the catalog you are using.
   If the `location` property is specified in the catalog, we can omit it in 
the schema.
   ```



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the 
jars that contain all the necessary dependencies to access the corresponding 
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` 
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and 
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the 
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the 
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not 
contain the cloud storage classes (like hadoop-aws) and hadoop environment. 
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud 
storage filesets:
+
+| Hadoop runtime version | S3                                                  
                                                                                
     | GCS                                                                      
                                    | OSS                                       
                                                                                
                 | ABS                                                          
                                                      |
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment  | `gravitino-aws-bundle-${gravitino-version}.jar`     
                                                                                
     | `gravitino-gcp-bundle-${gravitino-version}.jar`                          
                                    | 
`gravitino-aliyun-bundle-${gravitino-version}.jar`                              
                                                           | 
`gravitino-azure-bundle-${gravitino-version}.jar`                               
                                   |
+| 2.x, 3.x               | `gravitino-aws-${gravitino-version}.jar`, 
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other 
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, 
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies | 
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, 
aliyun-sdk-java-{version} and other necessary dependencies | 
`gravitino-azure-${gravitino-version}.jar`, 
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` 
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get 
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS 
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) 
for hadoop2 or hadoop3. 
+
+If there still have some issues, please report it to the Gravitino community 
and create an issue. 
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset 
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key",
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";,
+    "filesystem-providers": "s3"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+    .put("location", "s3a://bucket/root")
+    .put("s3-access-key-id", "access_key")
+    .put("s3-secret-access-key", "secret_key")
+    .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com";)
+    .put("filesystem-providers", "s3")
+    .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a S3 fileset catalog",
+    s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+s3_properties = {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key"
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a S3 fileset 
catalog",
+                                             properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3` for AWS S3, for 
instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported 
due to the limitation of the hadoop-aws library.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file",
+    "filesystem-providers": "gcs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+    .put("location", "gs://bucket/root")
+    .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+    .put("filesystem-providers", "gcs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a GCS fileset catalog",
+    gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+gcs_properties = {
+    "location": "gcs://bucket/root",
+    "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a GCS fileset 
catalog",
+                                             properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The prefix of a GCS location should always start with `gs` for instance, 
`gs://bucket/root`.
+:::
+
+### Create an OSS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "oss://bucket/root",
+    "oss-access-key-id": "access_key",
+    "oss-secret-access-key": "secret_key",
+    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com";,
+    "filesystem-providers": "oss"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+ossProperties = ImmutableMap.<String, String>builder()
+    .put("location", "oss://bucket/root")
+    .put("oss-access-key-id", "access_key")
+    .put("oss-secret-access-key", "secret_key")
+    .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com";)
+    .put("filesystem-providers", "oss")
+    .build();
+
+Catalog ossProperties = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a OSS fileset catalog",
+    ossProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+oss_properties = {
+    "location": "oss://bucket/root",
+    "oss-access-key-id": "access_key"
+    "oss-secret-access-key": "secret_key",
+    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com";
+}
+
+oss_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a OSS fileset 
catalog",
+                                             properties=oss_properties)
+
+```
+
+### Create an ABS (Azure Blob Storage or ADLS) fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "abfss://container/root",
+    "abs-account-name": "The account name of the Azure Blob Storage",
+    "abs-account-key": "The account key of the Azure Blob Storage",
+    "filesystem-providers": "abs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+absProperties = ImmutableMap.<String, String>builder()
+    .put("location", "abfss://container/root")
+    .put("abs-account-name", "The account name of the Azure Blob Storage")
+    .put("abs-account-key", "The account key of the Azure Blob Storage")
+    .put("filesystem-providers", "abs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a Azure Blob storage fileset catalog",
+    absProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+abs_properties = {
+    "location": "gcs://bucket/root",
+    "abs_account_name": "The account name of the Azure Blob Storage",
+    "abs_account_key": "The account key of the Azure Blob Storage"  
+}
+
+abs_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a Azure Blob 
Storage fileset catalog",
+                                             properties=abs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+note:::
+The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always 
start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value 
like `abfs://container/root` is not supported.
+:::
+
+
+## Create fileset schema
+
+This part is the same for all cloud storage filesets, you can create the 
schema by the following sentence:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "schema",
+  "comment": "comment",
+  "properties": {
+    "location": "file:///tmp/root/schema"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+// Assuming you have just created a Hadoop catalog named `catalog`
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+    // Property "location" is optional, if specified all the managed fileset 
without
+    // specifying storage location will be stored under this location.
+    .put("location", "file:///tmp/root/schema")
+    .build();
+Schema schema = supportsSchemas.createSchema("schema",
+    "This is a schema",
+    schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+You can change the value of property `location` according to which catalog you 
are using, moreover, if we have set the `location` property in the catalog, we 
can omit the `location` property in the schema.
+
+## Create filesets
+
+The following sentences can be used to create a fileset in the schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "example_fileset",
+  "comment": "This is an example fileset",
+  "type": "MANAGED",
+  "storageLocation": "s3a://bucket/root/schema/example_fileset",
+  "properties": {
+    "k1": "v1"
+  }
+}' 
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+        .put("k1", "v1")
+        .build();
+
+filesetCatalog.createFileset(
+  NameIdentifier.of("schema", "example_fileset"),
+  "This is an example fileset",
+  Fileset.Type.MANAGED,
+  "s3a://bucket/root/schema/example_fileset",
+  propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", 
"example_fileset"),
+                                            type=Fileset.Type.MANAGED,
+                                            comment="This is an example 
fileset",
+                                            
storage_location="s3a://bucket/root/schema/example_fileset",
+                                            properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+Similar to schema, the `storageLocation` is optional if you have set the 
`location` property in the schema or catalog. Please change the value of 
+`location` as the actual location you want to store the fileset.
+
+The example above is for S3 fileset, you can replace the `storageLocation` 
with the actual location of the GCS, OSS, or ABS fileset.
+
+
+## Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with hadoop 
environment(hadoop 3.2.0)** to access the fileset:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, 
GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090";
+metalake_name = "test"
+
+catalog_name = "s3_catalog"
+schema_name = "schema"
+fileset_name = "example"
+
+## this is for S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-aws-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"])
+  .config("spark.hadoop.s3-secret-access-key", 
os.environ["S3_SECRET_ACCESS_KEY"])
+  .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com";)
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+
+### this is for GCS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.gcs-service-account-file", 
"/path/to/gcs-service-account-file.json")
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+
+### this is for OSS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"])
+  .config("spark.hadoop.oss-secret-access-key", 
os.environ["S3_SECRET_ACCESS_KEY"])
+  .config("spark.hadoop.oss-endpoint", "https://oss-cn-shanghai.aliyuncs.com";)
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+spark.sparkContext.setLogLevel("DEBUG")
+
+### this is for ABS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.azure-storage-account-name", "azure_account_name")
+  .config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+  .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", 
"true")
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path = 
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+  .mode("overwrite")
+  .option("header", "true")
+  .csv(gvfs_path)
+
+```
+
+If your Spark without Hadoop environment, you can use the following code 
snippet to access the fileset:
+
+```python
+## replace the env PYSPARK_SUBMIT_ARGS variable in the code above with the 
following content:
+### S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-aws-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
 --master local[1] pyspark-shell"
+### GCS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
 --master local[1] pyspark-shell"
+### OSS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
 --master local[1] pyspark-shell"
+#### Azure Blob Storage
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
 --master local[1] pyspark-shell"
+```
+
+:::note
+**In some Spark version, Hadoop environment is needed by the driver, adding 
the bundle jars with '--jars' may not work, in this case, you should add the 
jars to the spark classpath directly.**

Review Comment:
   You don't need to use bold fonts here since the text is already in a `note`.



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:

Review Comment:
   End a sentence with a full period `.`; start a new sentence with a capital 
letter,
   and optionally leave two spaces between two sentences.



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the 
jars that contain all the necessary dependencies to access the corresponding 
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` 
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and 
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the 
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the 
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not 
contain the cloud storage classes (like hadoop-aws) and hadoop environment. 
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud 
storage filesets:
+
+| Hadoop runtime version | S3                                                  
                                                                                
     | GCS                                                                      
                                    | OSS                                       
                                                                                
                 | ABS                                                          
                                                      |
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment  | `gravitino-aws-bundle-${gravitino-version}.jar`     
                                                                                
     | `gravitino-gcp-bundle-${gravitino-version}.jar`                          
                                    | 
`gravitino-aliyun-bundle-${gravitino-version}.jar`                              
                                                           | 
`gravitino-azure-bundle-${gravitino-version}.jar`                               
                                   |
+| 2.x, 3.x               | `gravitino-aws-${gravitino-version}.jar`, 
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other 
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, 
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies | 
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, 
aliyun-sdk-java-{version} and other necessary dependencies | 
`gravitino-azure-${gravitino-version}.jar`, 
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |

Review Comment:
   We don't have to use tables here. Markdown sucks when used to create tables, 
especially when the cell contents are long.
   We can use, for example, unordered lists for this.



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:

Review Comment:
   We have two sentences here... and the wordy statement can be simplified 
without losing anything:
   
   ```suggestion
   Gravitino bundles jars are used to access the cloud storage.
   They are divided into two categories:
   ```



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset catalog classpath located at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you 
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the 
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local        | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3.                                                       
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS.                                                   
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage, 
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the 
jars that contain all the necessary dependencies to access the corresponding 
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar` 
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and 
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the 
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the 
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not 
contain the cloud storage classes (like hadoop-aws) and hadoop environment. 
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud 
storage filesets:
+
+| Hadoop runtime version | S3                                                  
                                                                                
     | GCS                                                                      
                                    | OSS                                       
                                                                                
                 | ABS                                                          
                                                      |
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment  | `gravitino-aws-bundle-${gravitino-version}.jar`     
                                                                                
     | `gravitino-gcp-bundle-${gravitino-version}.jar`                          
                                    | 
`gravitino-aliyun-bundle-${gravitino-version}.jar`                              
                                                           | 
`gravitino-azure-bundle-${gravitino-version}.jar`                               
                                   |
+| 2.x, 3.x               | `gravitino-aws-${gravitino-version}.jar`, 
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other 
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`, 
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies | 
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar, 
aliyun-sdk-java-{version} and other necessary dependencies | 
`gravitino-azure-${gravitino-version}.jar`, 
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar` 
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get 
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS 
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases) 
for hadoop2 or hadoop3. 
+
+If there still have some issues, please report it to the Gravitino community 
and create an issue. 
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset 
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key",
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";,
+    "filesystem-providers": "s3"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+    .put("location", "s3a://bucket/root")
+    .put("s3-access-key-id", "access_key")
+    .put("s3-secret-access-key", "secret_key")
+    .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com";)
+    .put("filesystem-providers", "s3")
+    .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a S3 fileset catalog",
+    s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+s3_properties = {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key"
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a S3 fileset 
catalog",
+                                             properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3` for AWS S3, for 
instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported 
due to the limitation of the hadoop-aws library.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file",
+    "filesystem-providers": "gcs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+    .put("location", "gs://bucket/root")
+    .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+    .put("filesystem-providers", "gcs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a GCS fileset catalog",
+    gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+gcs_properties = {
+    "location": "gcs://bucket/root",
+    "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a GCS fileset 
catalog",
+                                             properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The prefix of a GCS location should always start with `gs` for instance, 
`gs://bucket/root`.

Review Comment:
   ```suggestion
   The prefix of a GCS location should always start with `gs`, for instance, 
`gs://bucket/root`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#5472] improvement(docs): Add example to use cloud stroage fileset and polish hadoop-catalog document. [gravitino]

Reply via email to