tengqm commented on code in PR #7111: URL: https://github.com/apache/gravitino/pull/7111#discussion_r2076974711
########## docs/catalogs/fileset/gvfs/index.md: ########## @@ -0,0 +1,1078 @@ +--- +title: Using Apache Gravitino Virtual File System for Filesets +slug: /how-to-use-gvfs +license: "This software is licensed under the Apache License version 2." +--- + +## Introduction + +*Fileset* in Apache Gravitinois a conceptual, logical collection of files and directories, +In Gravitino, you can manage non-tabular data with filesets. +For more details, refer to [managing fileset using Gravitino](../../../metadata/fileset.md). + +Gravitino provides a virtual file system layer called the Gravitino Virtual File System (GVFS), +for managing filesets. + +* In Java, it's built on top of the Hadoop Compatible File System(HCFS) interface. +* In Python, it's built on top of the [fsspec](https://filesystem-spec.readthedocs.io/en/stable/index.html) interface. + +GVFS is a virtual layer that manages the files and directories in the fileset through a virtual path, +without needing to understand the specific storage details of the fileset. +You can access the files or folders as shown below: + +```text +gvfs://fileset/{catalog}/{schema}/{fileset}/sub_dir/ +``` + +In python, you can also access the files or folders as shown below: + +```text +fileset/{catalog}/{schema}/{fileset}/sub_dir/ +``` + +where + +- `gvfs`: the scheme of the GVFS. +- `fileset`: the root directory of the GVFS; it is immutable. +- `{catalog}/{schema}/{fileset}`: the virtual path of the fileset. + +You can access the files and folders under this virtual path +by concatenating a file or folder name to the virtual path. + +The usage pattern for GVFS is the same as HDFS or S3. +GVFS internally manages the path mapping and convert automatically. + +## Managing Fileset with Java GVFS + +- GVFS has been tested against Hadoop 3.3.1. + It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. + Please create an [issue](https://www.github.com/apache/gravitino/issues) + if you find any compatibility issues. + +### Java GVFS Configuration + +<table> +<thead> +<tr> + <th>Configuration item</th> + <th>Description</th> + <th>Default value</th> + <th>Required</th> + <th>Since version</th> +</tr> +</thead> +<tbody> +<tr> + <td><tt>fs.AbstractFileSystem.gvfs.impl</tt></td> + <td> + The Gravitino Virtual File System (GVFS) abstract class. + Set it to `org.apache.gravitino.filesystem.hadoop.Gvfs`. + </td> + <td>(none)</td> + <td>Yes</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gvfs.impl</tt></td> + <td> + The Gravitino Virtual File System implementation class. + Set it to `org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem`. + </td> + <td>(none)</td> + <td>Yes</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gvfs.impl.disable.cache</tt></td> + <td> + Disable the Gravitino Virtual File System cache in the Hadoop environment. + If you need to proxy multi-user operations, you can set this value to `true` + and create a separate File System for each user. + </td> + <td>`false`</td> + <td>No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.server.uri</tt></td> + <td> + The Gravitino server URI from which GVFS loads the fileset metadata. + </td> + <td>(none)</td> + <td>Yes</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.metalake</tt></td> + <td>The metalake to which the fileset belongs.</td> + <td>(none)</td> + <td>Yes</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.authType</tt></td> + <td> + The authentication type for initializing the Gravitino client + to use the Gravitino Virtual File System. + Currently only supports `simple`, `oauth2` and `kerberos` authentication types. + </td> + <td>`simple`</td> + <td>No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.oauth2.serverUri</tt></td> + <td> + The authentication server URI for the Gravitino client + when using the `oauth2` for the Gravitino Virtual File System. + + This field is required if `oauth2` is used, or else it is optional. + </td> + <td>(none)</td> + <td>Yes|No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.oauth2.credential</tt></td> + <td> + The authentication credential for the Gravitino client + when using `oauth2` for the Gravitino Virtual File System. + + This field is required if `oauth2` is used. + </td> + <td>(none)</td> + <td>Yes|No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.oauth2.path</tt></td> + <td> + The authentication server path for the Gravitino client + when using `oauth2` for the Gravitino Virtual File System. + Please remove the first slash `/` from the path, for example `oauth/token`. + + This field is required if `oauth2` is used. + </td> + <td>(none)</td> + <td>Yes|No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.oauth2.scope</tt></td> + <td> + The authentication scope for the Gravitino client + when using `oauth2` with the Gravitino Virtual File System. + </td> + <td>(none)</td> + <td>Yes|No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.kerberos.principal</tt></td> + <td> + The authentication principal for the Gravitino client + when using `kerberos` for authentication against the Gravitino Virtual File System. + + If the `kerberos` authentication type is used, then this field is required. + </td> + <td>(none)</td> + <td>Yes|No</td> + <td>0.5.1</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.kerberos.keytabFilePath</tt></td> + <td> + The authentication keytab file path for the Gravitino client + when using `kerberos` for authentication against the Gravitino Virtual File System. + </td> + <td>(none)</td> + <td>No</td> + <td>0.5.1</td> +</tr> +<tr> + <td><tt>fs.gravitino.fileset.cache.maxCapacity</tt></td> + <td>The cache capacity of the Gravitino Virtual File System.</td> + <td>`20`</td> + <td>No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.fileset.cache.evictionMillsAfterAccess</tt></td> + <td> + The value of time that the cache expires after accessing in the Gravitino Virtual File System. + + The value is in milliseconds. + </td> + <td>`3600000`</td> + <td>No</td> + <td>0.5.0</td> +</tr> +<tr> + <td><tt>fs.gravitino.current.location.name</tt></td> + <td> + The configuration used to select the location of the fileset. + If this configuration is not set, the value of environment variable configured by + `fs.gravitino.current.location.env.var` will be checked. + If neither is set, the value of fileset property `default-location-name` will be used as the location name. + </td> + <td>the value of fileset property `default-location-name`</td> + <td>No</td> + <td>0.9.0-incubating</td> +</tr> +<tr> + <td><tt>fs.gravitino.current.location.name.env.var</tt></td> + <td>The environment variable name to get the current location name.</td> + <td>`CURRENT_LOCATION_NAME`</td> + <td>No</td> + <td>0.9.0-incubating</td> +</tr> +<tr> + <td><tt>fs.gravitino.operations.class</tt></td> + <td> + The operations class that provides the FS operations for the Gravitino Virtual File System. + Users can extend the `BaseGVFSOperations` interface to implement their own operations, + and then set the class name using this configuration item to use their custom FS operations. + </td> + <td>`org.apache.gravitino.filesystem.hadoop.DefaultGVFSOperations`</td> + <td>No</td> + <td>`0.9.0-incubating`</td> +</tr> +<tr> + <td><tt>fs.gravitino.hook.class</tt></td> + <td> + The hook class to inject into the Gravitino Virtual File System. + Users can implement their own `GravitinoVirtualFileSystemHook`, + and then set this configuration to the class name to inject custom code. + </td> + <td>`org.apache.gravitino.filesystem.hadoop.NoOpHook`</td> + <td>No</td> + <td>`0.9.0-incubating`</td> +</tr> +<tr> + <td><tt>fs.gravitino.client.request.header.</tt></td> + <td> + The configuration key prefix for the Gravitino client request header. + You can set the request header for the Gravitino client. + </td> + <td>(none)</td> + <td>No</td> + <td>`0.9.0-incubating`</td> +</tr> +<tr> + <td><tt>fs.gravitino.enableCredentialVending</tt></td> + <td>Whether to enable credential vending for the Gravitino Virtual File System. </td> + <td>`false`</td> + <td>No</td> + <td>`0.9.0-incubating`</td> +</tr> +</tbody> +</table> + +In addition to the above properties, to access fileset like S3, GCS, OSS and custom fileset, +some extra properties are needed. For more information, please see + +- [S3 GVFS Java client configurations](../hadoop/s3.md#using-the-gvfs-java-client-to-access-the-fileset) +- [GCS GVFS Java client configurations](../hadoop/gcs.md#using-the-gvfs-java-client-to-access-the-fileset) +- [OSS GVFS Java client configurations](../hadoop/oss.md#using-the-gvfs-java-client-to-access-the-fileset) +- [Azure Blob Storage GVFS Java client configurations](../hadoop/adls.md#using-the-gvfs-java-client-to-access-the-fileset) + +#### Custom fileset + +Since *0.7.0-incubating*, users can define their own fileset type and configure the corresponding properties. +For more details, please refer to [Custom Fileset](../hadoop/hadoop-catalog.md#how-to-custom-your-own-hcfs-file-system-fileset). +If you want to access the custom fileset through GVFS, you need to configure the corresponding properties. + +<table> +<thead> +<tr> + <td>Configuration item</td> + <td>Description</td> + <td>Default value</td> + <td>Required</td> + <td>Since version</td> +</tr> +</thead> +<tbody> +<tr> + <td><tt>your-custom-properties</tt></td> + <td> + The properties to be used to create a FileSystem instance + in `CustomFileSystemProvider#getFileSystem` + </td> + <td>(none)</td> + <td>No</td> + <td></td> +</tr> +</tbody> +</table> + +You can configure these properties in two ways: + +1. Before obtaining the `FileSystem` in the code, construct a `Configuration` object and set its properties: + + ```java + Configuration conf = new Configuration(); + conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs"); + conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem"); + conf.set("fs.gravitino.server.uri", "http://localhost:8090"); + conf.set("fs.gravitino.client.metalake", "mymetalake"); + Path filesetPath = new Path("gvfs://fileset/mycatalog/myschema/my_fileset_1"); + FileSystem fs = filesetPath.getFileSystem(conf); + ``` + +1. Configure the properties in the `core-site.xml` file for the Hadoop environment: Review Comment: In Markdown, an ordered list doesn't have to have an explicit number. The general practice is to let the renderer figure out which one should be number 2 and which one is number 3. The benefit is that we, as the docs writers, don't need to manually maintain the order of the list. In other words, if you want to insert an item into a list, you focus on that new item without worrying about the numbers after the one you insert. Similarly, when you remove an unnecessary item from a list, you don't need to change other items. The markdown parser (a correctly working one of course) will take care of it. Hope this solves your concern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gravitino.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org