tengqm commented on code in PR #6059:
URL: https://github.com/apache/gravitino/pull/6059#discussion_r1901392704
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the
jars that contain all the necessary dependencies to access the corresponding
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar`
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the
runtime.
Review Comment:
Always follow the existing spelling of established words, e.g. Hadoop,
Apache, AWS.
```suggestion
They are used when there is no Hadoop environment in the runtime.
```
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the
jars that contain all the necessary dependencies to access the corresponding
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar`
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not
contain the cloud storage classes (like hadoop-aws) and hadoop environment.
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud
storage filesets:
+
+| Hadoop runtime version | S3
| GCS
| OSS
| ABS
|
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar`
| `gravitino-gcp-bundle-${gravitino-version}.jar`
|
`gravitino-aliyun-bundle-${gravitino-version}.jar`
|
`gravitino-azure-bundle-${gravitino-version}.jar`
|
+| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`,
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`,
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies |
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar,
aliyun-sdk-java-{version} and other necessary dependencies |
`gravitino-azure-${gravitino-version}.jar`,
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar`
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases)
for hadoop2 or hadoop3.
+
+If there still have some issues, please report it to the Gravitino community
and create an issue.
Review Comment:
Let's make the users' life easier by providing a link.
Or else they have to check where is the community and/or where is the repo.
```suggestion
If there are still issues, please consider [filing an issue](...)
```
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the
jars that contain all the necessary dependencies to access the corresponding
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar`
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not
contain the cloud storage classes (like hadoop-aws) and hadoop environment.
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud
storage filesets:
+
+| Hadoop runtime version | S3
| GCS
| OSS
| ABS
|
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar`
| `gravitino-gcp-bundle-${gravitino-version}.jar`
|
`gravitino-aliyun-bundle-${gravitino-version}.jar`
|
`gravitino-azure-bundle-${gravitino-version}.jar`
|
+| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`,
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`,
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies |
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar,
aliyun-sdk-java-{version} and other necessary dependencies |
`gravitino-azure-${gravitino-version}.jar`,
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar`
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases)
for hadoop2 or hadoop3.
+
+If there still have some issues, please report it to the Gravitino community
and create an issue.
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key",
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
+ "filesystem-providers": "s3"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+ .put("location", "s3a://bucket/root")
+ .put("s3-access-key-id", "access_key")
+ .put("s3-secret-access-key", "secret_key")
+ .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
+ .put("filesystem-providers", "s3")
+ .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a S3 fileset catalog",
+ s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+s3_properties = {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key"
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a S3 fileset
catalog",
+ properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3` for AWS S3, for
instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported
due to the limitation of the hadoop-aws library.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "gs://bucket/root",
+ "gcs-service-account-file": "path_of_gcs_service_account_file",
+ "filesystem-providers": "gcs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+ .put("location", "gs://bucket/root")
+ .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+ .put("filesystem-providers", "gcs")
+ .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a GCS fileset catalog",
+ gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+gcs_properties = {
+ "location": "gcs://bucket/root",
+ "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a GCS fileset
catalog",
+ properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The prefix of a GCS location should always start with `gs` for instance,
`gs://bucket/root`.
+:::
+
+### Create an OSS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "oss://bucket/root",
+ "oss-access-key-id": "access_key",
+ "oss-secret-access-key": "secret_key",
+ "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com",
+ "filesystem-providers": "oss"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+ossProperties = ImmutableMap.<String, String>builder()
+ .put("location", "oss://bucket/root")
+ .put("oss-access-key-id", "access_key")
+ .put("oss-secret-access-key", "secret_key")
+ .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com")
+ .put("filesystem-providers", "oss")
+ .build();
+
+Catalog ossProperties = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a OSS fileset catalog",
+ ossProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+oss_properties = {
+ "location": "oss://bucket/root",
+ "oss-access-key-id": "access_key"
+ "oss-secret-access-key": "secret_key",
+ "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com"
+}
+
+oss_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a OSS fileset
catalog",
+ properties=oss_properties)
+
+```
+
+### Create an ABS (Azure Blob Storage or ADLS) fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "abfss://container/root",
+ "abs-account-name": "The account name of the Azure Blob Storage",
+ "abs-account-key": "The account key of the Azure Blob Storage",
+ "filesystem-providers": "abs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+absProperties = ImmutableMap.<String, String>builder()
+ .put("location", "abfss://container/root")
+ .put("abs-account-name", "The account name of the Azure Blob Storage")
+ .put("abs-account-key", "The account key of the Azure Blob Storage")
+ .put("filesystem-providers", "abs")
+ .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a Azure Blob storage fileset catalog",
+ absProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+abs_properties = {
+ "location": "gcs://bucket/root",
+ "abs_account_name": "The account name of the Azure Blob Storage",
+ "abs_account_key": "The account key of the Azure Blob Storage"
+}
+
+abs_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a Azure Blob
Storage fileset catalog",
+ properties=abs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+note:::
+The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always
start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value
like `abfs://container/root` is not supported.
+:::
+
+
+## Create fileset schema
+
+This part is the same for all cloud storage filesets, you can create the
schema by the following sentence:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "schema",
+ "comment": "comment",
+ "properties": {
+ "location": "file:///tmp/root/schema"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+// Assuming you have just created a Hadoop catalog named `catalog`
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ // Property "location" is optional, if specified all the managed fileset
without
+ // specifying storage location will be stored under this location.
+ .put("location", "file:///tmp/root/schema")
+ .build();
+Schema schema = supportsSchemas.createSchema("schema",
+ "This is a schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+You can change the value of property `location` according to which catalog you
are using, moreover, if we have set the `location` property in the catalog, we
can omit the `location` property in the schema.
Review Comment:
```suggestion
You can change the `location` value based on the catalog you are using.
If the `location` property is specified in the catalog, we can omit it in
the schema.
```
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the
jars that contain all the necessary dependencies to access the corresponding
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar`
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not
contain the cloud storage classes (like hadoop-aws) and hadoop environment.
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud
storage filesets:
+
+| Hadoop runtime version | S3
| GCS
| OSS
| ABS
|
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar`
| `gravitino-gcp-bundle-${gravitino-version}.jar`
|
`gravitino-aliyun-bundle-${gravitino-version}.jar`
|
`gravitino-azure-bundle-${gravitino-version}.jar`
|
+| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`,
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`,
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies |
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar,
aliyun-sdk-java-{version} and other necessary dependencies |
`gravitino-azure-${gravitino-version}.jar`,
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar`
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases)
for hadoop2 or hadoop3.
+
+If there still have some issues, please report it to the Gravitino community
and create an issue.
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key",
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
+ "filesystem-providers": "s3"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+ .put("location", "s3a://bucket/root")
+ .put("s3-access-key-id", "access_key")
+ .put("s3-secret-access-key", "secret_key")
+ .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
+ .put("filesystem-providers", "s3")
+ .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a S3 fileset catalog",
+ s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+s3_properties = {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key"
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a S3 fileset
catalog",
+ properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3` for AWS S3, for
instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported
due to the limitation of the hadoop-aws library.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "gs://bucket/root",
+ "gcs-service-account-file": "path_of_gcs_service_account_file",
+ "filesystem-providers": "gcs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+ .put("location", "gs://bucket/root")
+ .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+ .put("filesystem-providers", "gcs")
+ .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a GCS fileset catalog",
+ gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+gcs_properties = {
+ "location": "gcs://bucket/root",
+ "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a GCS fileset
catalog",
+ properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The prefix of a GCS location should always start with `gs` for instance,
`gs://bucket/root`.
+:::
+
+### Create an OSS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "oss://bucket/root",
+ "oss-access-key-id": "access_key",
+ "oss-secret-access-key": "secret_key",
+ "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com",
+ "filesystem-providers": "oss"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+ossProperties = ImmutableMap.<String, String>builder()
+ .put("location", "oss://bucket/root")
+ .put("oss-access-key-id", "access_key")
+ .put("oss-secret-access-key", "secret_key")
+ .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com")
+ .put("filesystem-providers", "oss")
+ .build();
+
+Catalog ossProperties = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a OSS fileset catalog",
+ ossProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+oss_properties = {
+ "location": "oss://bucket/root",
+ "oss-access-key-id": "access_key"
+ "oss-secret-access-key": "secret_key",
+ "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com"
+}
+
+oss_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a OSS fileset
catalog",
+ properties=oss_properties)
+
+```
+
+### Create an ABS (Azure Blob Storage or ADLS) fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "abfss://container/root",
+ "abs-account-name": "The account name of the Azure Blob Storage",
+ "abs-account-key": "The account key of the Azure Blob Storage",
+ "filesystem-providers": "abs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+absProperties = ImmutableMap.<String, String>builder()
+ .put("location", "abfss://container/root")
+ .put("abs-account-name", "The account name of the Azure Blob Storage")
+ .put("abs-account-key", "The account key of the Azure Blob Storage")
+ .put("filesystem-providers", "abs")
+ .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a Azure Blob storage fileset catalog",
+ absProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+abs_properties = {
+ "location": "gcs://bucket/root",
+ "abs_account_name": "The account name of the Azure Blob Storage",
+ "abs_account_key": "The account key of the Azure Blob Storage"
+}
+
+abs_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a Azure Blob
Storage fileset catalog",
+ properties=abs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+note:::
+The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always
start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value
like `abfs://container/root` is not supported.
+:::
+
+
+## Create fileset schema
+
+This part is the same for all cloud storage filesets, you can create the
schema by the following sentence:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "schema",
+ "comment": "comment",
+ "properties": {
+ "location": "file:///tmp/root/schema"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+// Assuming you have just created a Hadoop catalog named `catalog`
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+ // Property "location" is optional, if specified all the managed fileset
without
+ // specifying storage location will be stored under this location.
+ .put("location", "file:///tmp/root/schema")
+ .build();
+Schema schema = supportsSchemas.createSchema("schema",
+ "This is a schema",
+ schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+You can change the value of property `location` according to which catalog you
are using, moreover, if we have set the `location` property in the catalog, we
can omit the `location` property in the schema.
+
+## Create filesets
+
+The following sentences can be used to create a fileset in the schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation": "s3a://bucket/root/schema/example_fileset",
+ "properties": {
+ "k1": "v1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+ .put("k1", "v1")
+ .build();
+
+filesetCatalog.createFileset(
+ NameIdentifier.of("schema", "example_fileset"),
+ "This is an example fileset",
+ Fileset.Type.MANAGED,
+ "s3a://bucket/root/schema/example_fileset",
+ propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema",
"example_fileset"),
+ type=Fileset.Type.MANAGED,
+ comment="This is an example
fileset",
+
storage_location="s3a://bucket/root/schema/example_fileset",
+ properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+Similar to schema, the `storageLocation` is optional if you have set the
`location` property in the schema or catalog. Please change the value of
+`location` as the actual location you want to store the fileset.
+
+The example above is for S3 fileset, you can replace the `storageLocation`
with the actual location of the GCS, OSS, or ABS fileset.
+
+
+## Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with hadoop
environment(hadoop 3.2.0)** to access the fileset:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset,
GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090"
+metalake_name = "test"
+
+catalog_name = "s3_catalog"
+schema_name = "schema"
+fileset_name = "example"
+
+## this is for S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aws-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/path/to/hadoop-aws-3.2.0.jar,/path/to/aws-java-sdk-bundle-1.11.375.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("s3_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"])
+ .config("spark.hadoop.s3-secret-access-key",
os.environ["S3_SECRET_ACCESS_KEY"])
+ .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+### this is for GCS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("s3_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.gcs-service-account-file",
"/path/to/gcs-service-account-file.json")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+### this is for OSS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aliyun-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/aliyun-sdk-oss-2.8.3.jar,/path/to/hadoop-aliyun-3.2.0.jar,/path/to/jdom-1.1.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("s3_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"])
+ .config("spark.hadoop.oss-secret-access-key",
os.environ["S3_SECRET_ACCESS_KEY"])
+ .config("spark.hadoop.oss-endpoint", "https://oss-cn-shanghai.aliyuncs.com")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+spark.sparkContext.setLogLevel("DEBUG")
+
+### this is for ABS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
--master local[1] pyspark-shell"
+spark = SparkSession.builder
+ .appName("s3_fielset_test")
+ .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+ .config("spark.hadoop.fs.gvfs.impl",
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+ .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090")
+ .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+ .config("spark.hadoop.azure-storage-account-name", "azure_account_name")
+ .config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+ .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization",
"true")
+ .config("spark.driver.memory", "2g")
+ .config("spark.driver.port", "2048")
+ .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path =
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+ .mode("overwrite")
+ .option("header", "true")
+ .csv(gvfs_path)
+
+```
+
+If your Spark without Hadoop environment, you can use the following code
snippet to access the fileset:
+
+```python
+## replace the env PYSPARK_SUBMIT_ARGS variable in the code above with the
following content:
+### S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aws-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
--master local[1] pyspark-shell"
+### GCS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-gcp-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
--master local[1] pyspark-shell"
+### OSS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-aliyun-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,
--master local[1] pyspark-shell"
+#### Azure Blob Storage
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars
/path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
--master local[1] pyspark-shell"
+```
+
+:::note
+**In some Spark version, Hadoop environment is needed by the driver, adding
the bundle jars with '--jars' may not work, in this case, you should add the
jars to the spark classpath directly.**
Review Comment:
You don't need to use bold fonts here since the text is already in a `note`.
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
Review Comment:
End a sentence with a full period `.`; start a new sentence with a capital
letter,
and optionally leave two spaces between two sentences.
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the
jars that contain all the necessary dependencies to access the corresponding
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar`
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not
contain the cloud storage classes (like hadoop-aws) and hadoop environment.
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud
storage filesets:
+
+| Hadoop runtime version | S3
| GCS
| OSS
| ABS
|
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar`
| `gravitino-gcp-bundle-${gravitino-version}.jar`
|
`gravitino-aliyun-bundle-${gravitino-version}.jar`
|
`gravitino-azure-bundle-${gravitino-version}.jar`
|
+| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`,
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`,
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies |
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar,
aliyun-sdk-java-{version} and other necessary dependencies |
`gravitino-azure-${gravitino-version}.jar`,
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
Review Comment:
We don't have to use tables here. Markdown sucks when used to create tables,
especially when the cell contents are long.
We can use, for example, unordered lists for this.
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
Review Comment:
We have two sentences here... and the wordy statement can be simplified
without losing anything:
```suggestion
Gravitino bundles jars are used to access the cloud storage.
They are divided into two categories:
```
##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,678 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud
storage fileset created by Gravitino, it usually contains the following
sections:
+
+## Necessary steps in Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into
the fileset catalog classpath located at
`${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you
need to put `gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the
fileset catalog classpath in `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+
+| Storage type | Description
| Jar file
| Since Version |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local | The local file system.
| (none)
| 0.5.0 |
+| HDFS | HDFS file system.
| (none)
| 0.5.0 |
+| S3 | AWS S3.
|
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
| 0.8.0-incubating |
+| GCS | Google Cloud Storage.
|
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
| 0.8.0-incubating |
+| OSS | Aliyun OSS.
|
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
| 0.8.0-incubating |
+| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2)
|
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
| 0.8.0-incubating |
+
+After adding the jars into the fileset catalog classpath, you can start up the
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+Gravitino bundles jars are jars that are used to access the cloud storage,
they are divided into two categories:
+
+- `gravitino-${aws,gcp,aliyun,azure}-bundle-{gravitino-version}.jar` are the
jars that contain all the necessary dependencies to access the corresponding
cloud storages. For instance, `gravitino-aws-bundle-${gravitino-version}.jar`
contains the all necessary classes including `hadoop-common`(hadoop-3.3.1) and
`hadoop-aws` to access the S3 storage.
+They are used in the scenario where there is no hadoop environment in the
runtime.
+
+- If there is already hadoop environment in the runtime, you can use the
`gravitino-${aws,gcp,aliyun,azure}-${gravitino-version}.jar` that does not
contain the cloud storage classes (like hadoop-aws) and hadoop environment.
Alternatively, you can manually add the necessary jars to the classpath.
+
+The following table demonstrates which jars are necessary for different cloud
storage filesets:
+
+| Hadoop runtime version | S3
| GCS
| OSS
| ABS
|
+|------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment | `gravitino-aws-bundle-${gravitino-version}.jar`
| `gravitino-gcp-bundle-${gravitino-version}.jar`
|
`gravitino-aliyun-bundle-${gravitino-version}.jar`
|
`gravitino-azure-bundle-${gravitino-version}.jar`
|
+| 2.x, 3.x | `gravitino-aws-${gravitino-version}.jar`,
`hadoop-aws-${hadoop-version}.jar`, `aws-sdk-java-${version}` and other
necessary dependencies | `gravitino-gcp-{gravitino-version}.jar`,
`gcs-connector-${hadoop-version}`.jar, other necessary dependencies |
`gravitino-aliyun-{gravitino-version}.jar`, hadoop-aliyun-{hadoop-version}.jar,
aliyun-sdk-java-{version} and other necessary dependencies |
`gravitino-azure-${gravitino-version}.jar`,
`hadoop-azure-${hadoop-version}.jar`, and other necessary dependencies |
+
+For `hadoop-aws-${hadoop-version}.jar`, `hadoop-azure-${hadoop-version}.jar`
and `hadoop-aliyun-${hadoop-version}.jar` and related dependencies, you can get
them from `${HADOOP_HOME}/share/hadoop/tools/lib/` directory.
+For `gcs-connector`, you can download it from the [GCS
connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/releases)
for hadoop2 or hadoop3.
+
+If there still have some issues, please report it to the Gravitino community
and create an issue.
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key",
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
+ "filesystem-providers": "s3"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+ .put("location", "s3a://bucket/root")
+ .put("s3-access-key-id", "access_key")
+ .put("s3-secret-access-key", "secret_key")
+ .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
+ .put("filesystem-providers", "s3")
+ .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a S3 fileset catalog",
+ s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+s3_properties = {
+ "location": "s3a://bucket/root",
+ "s3-access-key-id": "access_key"
+ "s3-secret-access-key": "secret_key",
+ "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a S3 fileset
catalog",
+ properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3` for AWS S3, for
instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported
due to the limitation of the hadoop-aws library.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "location": "gs://bucket/root",
+ "gcs-service-account-file": "path_of_gcs_service_account_file",
+ "filesystem-providers": "gcs"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+ .builder("http://localhost:8090")
+ .withMetalake("metalake")
+ .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+ .put("location", "gs://bucket/root")
+ .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+ .put("filesystem-providers", "gcs")
+ .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+ Type.FILESET,
+ "hadoop", // provider, Gravitino only supports "hadoop" for now.
+ "This is a GCS fileset catalog",
+ gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient =
GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
+
+gcs_properties = {
+ "location": "gcs://bucket/root",
+ "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+ type=Catalog.Type.FILESET,
+ provider="hadoop",
+ comment="This is a GCS fileset
catalog",
+ properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The prefix of a GCS location should always start with `gs` for instance,
`gs://bucket/root`.
Review Comment:
```suggestion
The prefix of a GCS location should always start with `gs`, for instance,
`gs://bucket/root`.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]