Re: [PR] [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. [gravitino]

via GitHub Mon, 06 Jan 2025 04:22:44 -0800


yuqi1129 commented on code in PR #6059:
URL: https://github.com/apache/gravitino/pull/6059#discussion_r1903623224



##########
docs/hadoop-catalog-with-gcs.md:
##########
@@ -0,0 +1,413 @@
+---
+title: "Hadoop catalog with GCS"
+slug: /hadoop-catalog-with-gcs
+date: 2024-01-03
+keyword: Hadoop catalog GCS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with GCS.
+
+## Prerequisites
+
+In order to create a Hadoop catalog with GCS, you need to place 
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
 in Gravitino Hadoop classpath located
+at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino 
server with the following command:
+
+```bash
+$ bin/gravitino-server.sh start
+```
+
+## Create a Hadoop Catalog with GCS in Gravitino
+
+### Catalog a catalog
+
+Apart from configuration method in 
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), 
the following properties are required to configure a Hadoop catalog with GCS:
+
+| Configuration item            | Description                                  
                                                                                
                                                                                
                | Default value   | Required                   | Since version  
  |
+|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------|
+| `filesystem-providers`        | The file system providers to add. Set it to 
`gs` if it's a GCS fileset, a comma separated string that contains `gs` like 
`gs,s3` to support multiple kinds of fileset including `gs`.                    
                    | (none)          | Yes                        | 
0.7.0-incubating |
+| `default-filesystem-provider` | The name default filesystem providers of 
this Hadoop catalog if users do not specify the scheme in the URI. Default 
value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 
'gs://' in the location. | `builtin-local` | No                         | 
0.7.0-incubating |
+| `gcs-service-account-file`    | The path of GCS service account JSON file.   
                                                                                
                                                                                
                | (none)          | Yes if it's a GCS fileset. | 
0.7.0-incubating |
+
+### Create a schema
+
+Refer to [Schema 
operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for 
more details.
+
+### Create a fileset
+
+Refer to [Fileset 
operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for 
more details.
+
+
+## Using Hadoop catalog with GCS
+
+### Create a Hadoop catalog/schema/file set with GCS
+
+First, you need to create a Hadoop catalog with GCS. The following example 
shows how to create a Hadoop catalog with GCS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",

Review Comment:
   OK



##########
docs/hadoop-catalog-with-gcs.md:
##########
@@ -0,0 +1,413 @@
+---
+title: "Hadoop catalog with GCS"
+slug: /hadoop-catalog-with-gcs
+date: 2024-01-03
+keyword: Hadoop catalog GCS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with GCS.
+
+## Prerequisites
+
+In order to create a Hadoop catalog with GCS, you need to place 
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
 in Gravitino Hadoop classpath located
+at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino 
server with the following command:
+
+```bash
+$ bin/gravitino-server.sh start
+```
+
+## Create a Hadoop Catalog with GCS in Gravitino
+
+### Catalog a catalog
+
+Apart from configuration method in 
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), 
the following properties are required to configure a Hadoop catalog with GCS:
+
+| Configuration item            | Description                                  
                                                                                
                                                                                
                | Default value   | Required                   | Since version  
  |
+|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------|
+| `filesystem-providers`        | The file system providers to add. Set it to 
`gs` if it's a GCS fileset, a comma separated string that contains `gs` like 
`gs,s3` to support multiple kinds of fileset including `gs`.                    
                    | (none)          | Yes                        | 
0.7.0-incubating |
+| `default-filesystem-provider` | The name default filesystem providers of 
this Hadoop catalog if users do not specify the scheme in the URI. Default 
value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 
'gs://' in the location. | `builtin-local` | No                         | 
0.7.0-incubating |
+| `gcs-service-account-file`    | The path of GCS service account JSON file.   
                                                                                
                                                                                
                | (none)          | Yes if it's a GCS fileset. | 
0.7.0-incubating |
+
+### Create a schema
+
+Refer to [Schema 
operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for 
more details.
+
+### Create a fileset
+
+Refer to [Fileset 
operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for 
more details.
+
+
+## Using Hadoop catalog with GCS
+
+### Create a Hadoop catalog/schema/file set with GCS
+
+First, you need to create a Hadoop catalog with GCS. The following example 
shows how to create a Hadoop catalog with GCS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file",
+    "filesystem-providers": "gcs"

Review Comment:
   It should be `gcs`, I will correct it.



##########
docs/hadoop-catalog-with-gcs.md:
##########
@@ -0,0 +1,413 @@
+---
+title: "Hadoop catalog with GCS"
+slug: /hadoop-catalog-with-gcs
+date: 2024-01-03
+keyword: Hadoop catalog GCS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with GCS.
+
+## Prerequisites
+
+In order to create a Hadoop catalog with GCS, you need to place 
[`gravitino-gcp-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-gcp-bundle)
 in Gravitino Hadoop classpath located
+at `${HADOOP_HOME}/share/hadoop/common/lib/`. After that, start Gravitino 
server with the following command:
+
+```bash
+$ bin/gravitino-server.sh start
+```
+
+## Create a Hadoop Catalog with GCS in Gravitino
+
+### Catalog a catalog
+
+Apart from configuration method in 
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), 
the following properties are required to configure a Hadoop catalog with GCS:
+
+| Configuration item            | Description                                  
                                                                                
                                                                                
                | Default value   | Required                   | Since version  
  |
+|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------|------------------|
+| `filesystem-providers`        | The file system providers to add. Set it to 
`gs` if it's a GCS fileset, a comma separated string that contains `gs` like 
`gs,s3` to support multiple kinds of fileset including `gs`.                    
                    | (none)          | Yes                        | 
0.7.0-incubating |
+| `default-filesystem-provider` | The name default filesystem providers of 
this Hadoop catalog if users do not specify the scheme in the URI. Default 
value is `builtin-local`, for GCS, if we set this value, we can omit the prefix 
'gs://' in the location. | `builtin-local` | No                         | 
0.7.0-incubating |
+| `gcs-service-account-file`    | The path of GCS service account JSON file.   
                                                                                
                                                                                
                | (none)          | Yes if it's a GCS fileset. | 
0.7.0-incubating |
+
+### Create a schema
+
+Refer to [Schema 
operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for 
more details.
+
+### Create a fileset
+
+Refer to [Fileset 
operation](./manage-fileset-metadata-using-gravitino.md#fileset-operations) for 
more details.
+
+
+## Using Hadoop catalog with GCS
+
+### Create a Hadoop catalog/schema/file set with GCS
+
+First, you need to create a Hadoop catalog with GCS. The following example 
shows how to create a Hadoop catalog with GCS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file",
+    "filesystem-providers": "gcs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+    .put("location", "gs://bucket/root")
+    .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+    .put("filesystem-providers", "gcs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a GCS fileset catalog",
+    gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+gcs_properties = {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file"
+}
+
+gcs_properties = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a GCS fileset 
catalog",
+                                             properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+Then create a schema and fileset in the catalog created above.
+
+Using the following code to create a schema and fileset:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "schema",
+  "comment": "comment",
+  "properties": {
+    "location": "gs://bucket/root/schema"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+// Assuming you have just created a Hive catalog named `hive_catalog`
+Catalog catalog = gravitinoClient.loadCatalog("hive_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+    .put("location", "gs://bucket/root/schema")
+    .build();
+Schema schema = supportsSchemas.createSchema("schema",
+    "This is a schema",
+    schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://127.0.0.1:8090";, metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="hive_catalog")
+catalog.as_schemas().create_schema(name="schema",
+                                   comment="This is a schema",
+                                   properties={"location": 
"gs://bucket/root/schema"})
+```
+
+</TabItem>
+</Tabs>
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "example_fileset",
+  "comment": "This is an example fileset",
+  "type": "MANAGED",
+  "storageLocation": "gs://bucket/root/schema/example_fileset",
+  "properties": {
+    "k1": "v1"
+  }
+}' 
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+        .put("k1", "v1")
+        .build();
+
+filesetCatalog.createFileset(
+  NameIdentifier.of("schema", "example_fileset"),
+  "This is an example fileset",
+  Fileset.Type.MANAGED,
+  "gs://bucket/root/schema/example_fileset",
+  propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", 
"example_fileset"),
+                                            type=Fileset.Type.MANAGED,
+                                            comment="This is an example 
fileset",
+                                            
storage_location="gs://bucket/root/schema/example_fileset",
+                                            properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop 
environment(Hadoop 3.2.0)** to access the fileset:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, 
GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090";
+metalake_name = "test"
+
+catalog_name = "your_gcs_catalog"
+schema_name = "your_gcs_schema"
+fileset_name = "your_gcs_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-gcp-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/gcs-connector-hadoop3-2.2.22-shaded.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+.appName("gcs_fielset_test")
+.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+.config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")

Review Comment:
   The value for these two configurations are constant and it's by design, 
users need to set it explicitly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. [gravitino]

Reply via email to