This is an automated email from the ASF dual-hosted git repository.
jshao pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/gravitino.git
The following commit(s) were added to refs/heads/main by this push:
new 14222171d [#3764] improvement(docs): Add user docs for using GVFS in
Python (#3931)
14222171d is described below
commit 14222171d4be88be9b4b88471140316123751248
Author: xloya <[email protected]>
AuthorDate: Wed Jul 10 11:06:07 2024 +0800
[#3764] improvement(docs): Add user docs for using GVFS in Python (#3931)
### What changes were proposed in this pull request?
Provides documentation for users to use Gravitino Virtual FileSystem in
Python.
### Why are the changes needed?
Fix: #3764
### How was this patch tested?
No code changes, no testing required.
---------
Co-authored-by: xiaojiebao <[email protected]>
---
docs/how-to-use-gvfs.md | 251 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 237 insertions(+), 14 deletions(-)
diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md
index 654c90387..46e0c1b60 100644
--- a/docs/how-to-use-gvfs.md
+++ b/docs/how-to-use-gvfs.md
@@ -11,8 +11,10 @@ directories, with `fileset` you can manage non-tabular data
through Gravitino. F
details, you can read [How to manage fileset metadata using
Gravitino](./manage-fileset-metadata-using-gravitino.md).
To use `Fileset` managed by Gravitino, Gravitino provides a virtual file
system layer called
-the Gravitino Virtual File System (GVFS) that's built on top of the Hadoop
Compatible File System
-(HCFS) interface.
+the Gravitino Virtual File System (GVFS):
+* In Java, it's built on top of the Hadoop Compatible File System(HCFS)
interface.
+* In Python, it's built on top of the
[fsspec](https://filesystem-spec.readthedocs.io/en/stable/index.html)
+interface.
GVFS is a virtual layer that manages the files and directories in the fileset
through a virtual
path, without needing to understand the specific storage details of the
fileset. You can access
@@ -22,6 +24,12 @@ the files or folders as shown below:
gvfs://fileset/${catalog_name}/${schema_name}/${fileset_name}/sub_dir/
```
+In python GVFS, you can also access the files or folders as shown below:
+
+```text
+fileset/${catalog_name}/${schema_name}/${fileset_name}/sub_dir/
+```
+
Here `gvfs` is the scheme of the GVFS, `fileset` is the root directory of the
GVFS which can't
modified, and `${catalog_name}/${schema_name}/${fileset_name}` is the virtual
path of the fileset.
You can access the files and folders under this virtual path by concatenating
a file or folder
@@ -30,14 +38,16 @@ name to the virtual path.
The usage pattern for GVFS is the same as HDFS or S3. GVFS internally manages
the path mapping and convert automatically.
-## Prerequisites
+## 1. Managing files of Fileset with Java GVFS
+
+### Prerequisites
+ A Hadoop environment with HDFS running. GVFS has been tested against
Hadoop 3.1.0. It is recommended to use Hadoop 3.1.0 or later, but it should
work with Hadoop 2.
x. Please create an [issue](https://www.github.com/apache/gravitino/issues)
if you find any
compatibility issues.
-## Configuration
+### Configuration
| Configuration item | Description
| Default value | Required |
Since version |
|-------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-------------------------------------|---------------|
@@ -94,7 +104,7 @@ You can configure these properties in two ways:
</property>
```
-## How to use the Apache Gravitino Virtual File System
+### Usage examples
First make sure to obtain the Gravitino Virtual File System runtime jar, which
you can get in
two ways:
@@ -111,7 +121,7 @@ two ways:
./gradlew :clients:filesystem-hadoop3-runtime:build -x test
```
-### Use GVFS via Hadoop shell command
+#### Via Hadoop shell command
You can use the Hadoop shell command to perform operations on the fileset
storage. For example:
@@ -131,7 +141,7 @@ kinit -kt your_kerberos.keytab [email protected]
./${HADOOP_HOME}/bin/hadoop dfs -ls
gvfs://fileset/test_catalog/test_schema/test_fileset_1
```
-### Using the GVFS via Java code
+#### Via Java code
You can also perform operations on the files or directories managed by fileset
through Java code.
Make sure that your code is using the correct Hadoop environment, and that
your environment
@@ -150,7 +160,7 @@ FileSystem fs = filesetPath.getFileSystem(conf);
fs.getFileStatus(filesetPath);
```
-### Using GVFS with Apache Spark
+#### Via Apache Spark
1. Add the GVFS runtime jar to the Spark environment.
@@ -190,7 +200,7 @@ fs.getFileStatus(filesetPath);
```
-### Using GVFS with Tensorflow
+#### Via Tensorflow
For Tensorflow to support GVFS, you need to recompile the
[tensorflow-io](https://github.com/tensorflow/io) module.
@@ -229,15 +239,15 @@ For Tensorflow to support GVFS, you need to recompile the
[tensorflow-io](https:
print(tf.io.gfile.listdir('gvfs://fileset/test_catalog/test_schema/test_fileset_1/'))
```
-## Authentication
+### Authentication
Currently, Gravitino Virtual File System supports two kinds of authentication
types to access Gravitino server: `simple` and `oauth2`.
The type of `simple` is the default authentication type in Gravitino Virtual
File System.
-### How to use authentication
+#### How to use authentication
-#### Using `simple` authentication
+##### Using `simple` authentication
First, make sure that your Gravitino server is also configured to use the
`simple` authentication mode.
@@ -261,7 +271,7 @@ Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_filese
FileSystem fs = filesetPath.getFileSystem(conf);
```
-#### Using `OAuth` authentication
+##### Using `OAuth` authentication
If you want to use `oauth2` authentication for the Gravitino client in the
Gravitino Virtual File System,
please refer to this document to complete the configuration of the Gravitino
server and the OAuth server: [Security](./security.md).
@@ -285,7 +295,7 @@ Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_filese
FileSystem fs = filesetPath.getFileSystem(conf);
```
-#### Using `Kerberos` authentication
+##### Using `Kerberos` authentication
If you want to use `kerberos` authentication for the Gravitino client in the
Gravitino Virtual File System,
please refer to this document to complete the configuration of the Gravitino
server: [Security](./security.md).
@@ -307,3 +317,216 @@ conf.set("fs.gravitino.client.kerberos.keytabFilePath",
"${your_kerberos_keytab}
Path filesetPath = new
Path("gvfs://fileset/test_catalog/test_schema/test_fileset_1");
FileSystem fs = filesetPath.getFileSystem(conf);
```
+
+## 2. Managing files of Fileset with Python GVFS
+
+### Prerequisites
+
++ A Hadoop environment with HDFS running. Now we only supports Fileset on HDFS.
+ GVFS in Python has been tested against Hadoop 2.7.3. It is recommended to
use Hadoop 2.7.3 or later,
+ it should work with Hadoop 3.x. Please create an
[issue](https://www.github.com/datastrato/gravitino/issues)
+ if you find any compatibility issues.
++ Python version >= 3.8. It has been tested GVFS works well with Python 3.8
and Python 3.9.
+ Your Python version should be at least higher than Python 3.8.
+
+Attention: If you are using macOS or Windows operating system, you need to
follow the steps in the
+[Hadoop official building
documentation](https://github.com/apache/hadoop/blob/trunk/BUILDING.txt)(Need
match your Hadoop version)
+to recompile the native libraries like `libhdfs` and others, and completely
replace the files in `${HADOOP_HOME}/lib/native`.
+
+### Configuration
+
+| Configuration item | Description
| Default
value | Required | Since version |
+|----------------------|---------------------------------------------------------------------------------------------------------------------------|---------------|----------|---------------|
+| `server_uri` | The Gravitino server uri, e.g.
`http://localhost:8090`.
| (none) | Yes | 0.6.0 |.
| (none) | Yes | 0.6.0 |
+| `metalake_name` | The metalake name which the fileset belongs to.
| (none)
| Yes | 0.6.0 |.
| (none)
| Yes | 0.6.0 | .
| (none) | Yes | 0.6.0 |
+| `cache_size` | The cache capacity of the Gravitino Virtual File
System. | `20`
| No | 0.6.0 |.
| (none)
| Yes | 0.6.0 | .
| (none) | Yes | 0.6.0 |
+| `cache_expired_time` | The value of time that the cache expires after
accessing in the Gravitino Virtual File System. The value is in `seconds`. |
`3600` | No | 0.6.0 |.
|
(none) | Yes | 0.6.0 | .
| (none) | Yes | 0.6.0 |
+
+
+You can configure these properties when obtaining the `Gravitino Virtual
FileSystem` in Python like this:
+
+```python
+from gravitino import gvfs
+
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake")
+```
+
+### Usage examples
+
+1. Make sure to obtain the Gravitino library.
+ You can get it by [pip](https://pip.pypa.io/en/stable/installation/):
+
+ ```shell
+ pip install gravitino
+ ```
+
+2. Configuring the Hadoop environment.
+ You should ensure that the Python client has Kerberos authentication
information and
+ configure Hadoop environments in the system environment:
+
+ ```shell
+ # kinit kerberos
+ kinit -kt /tmp/xxx.keytab [email protected]
+ # Or you can configure kerberos information in the Hadoop `core-site.xml`
file
+ <property>
+ <name>hadoop.security.authentication</name>
+ <value>kerberos</value>
+ </property>
+
+ <property>
+ <name>hadoop.client.kerberos.principal</name>
+ <value>[email protected]</value>
+ </property>
+
+ <property>
+ <name>hadoop.client.keytab.file</name>
+ <value>/tmp/xxx.keytab</value>
+ </property>
+ # Configure Hadoop env in Linux
+ export HADOOP_HOME=${YOUR_HADOOP_PATH}
+ export HADOOP_CONF_DIR=${YOUR_HADOOP_PATH}/etc/hadoop
+ export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
+ ```
+
+#### Via fsspec-style interface
+
+You can use the fsspec-style interface to perform operations on the fileset
files.
+
+For example:
+
+```python
+from gravitino import gvfs
+
+# init the gvfs
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090",
metalake_name="test_metalake")
+
+# list file infos under the fileset
+fs.ls(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir")
+
+# get file info under the fileset
+fs.info(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.parquet")
+
+# check a file or a diretory whether exists
+fs.exists(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir")
+
+# write something into a file
+with
fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt",
mode="wb") as output_stream:
+ output_stream.write(b"hello world")
+
+# append something into a file
+with
fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt",
mode="ab") as append_stream:
+ append_stream.write(b"hello world")
+
+# read something from a file
+with
fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt",
mode="rb") as input_stream:
+ input_stream.read()
+
+# copy a file
+fs.cp_file(path1="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt",
+
path2="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test-1.txt")
+
+# delete a file
+fs.rm_file(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/ttt/test-1.txt")
+
+# two methods to create a directory
+fs.makedirs(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_2")
+
+fs.mkdir(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_3")
+
+# delete a file or a directory recursively
+fs.rm(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_2",
recursive=True)
+
+# delete a directory
+fs.rmdir(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_2")
+
+# move a file or a directory
+fs.mv(path1="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test-1.txt",
+
path2="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test-2.txt")
+
+# get the content of a file
+fs.cat_file(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test-1.txt")
+
+# copy a remote file to local
+fs.get_file(rpath="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test-1.txt",
+ lpath="/tmp/local-file-1.txt")
+```
+
+#### Integrating with Third-party Python libraries
+
+You can also perform operations on the files or directories managed by fileset
+integrating with some Third-party Python libraries which support fsspec
compatible filesystems.
+
+For example:
+1. Integrating with
[Pandas](https://pandas.pydata.org/docs/reference/io.html)(2.0.3).
+
+```python
+from gravitino import gvfs
+import pandas as pd
+
+data = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'ID': [20, 21, 19, 18]})
+storage_options = {'server_uri': 'http://localhost:8090', 'metalake_name':
'test_metalake'}
+# save data to a parquet file under the fileset
+data.to_parquet('gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet',
storage_options=storage_options)
+
+# read data from a parquet file under the fileset
+ds =
pd.read_parquet(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet",
+ storage_options=storage_options)
+print(ds)
+
+# save data to a csv file under the fileset
+data.to_csv('gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.csv',
storage_options=storage_options)
+
+# save data from a csv file under the fileset
+df = pd.read_csv('gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.csv',
storage_options=storage_options)
+print(df)
+```
+
+2. Integrating with
[PyArrow](https://arrow.apache.org/docs/python/filesystems.html)(15.0.2).
+
+```python
+from gravitino import gvfs
+import pyarrow.dataset as dt
+import pyarrow.parquet as pq
+
+fs = gvfs.GravitinoVirtualFileSystem(
+ server_uri="http://localhost:8090", metalake_name="test_metalake"
+)
+
+# read a parquet file as arrow dataset
+arrow_dataset =
dt.dataset("gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet",
filesystem=fs)
+
+# read a parquet file as arrow parquet table
+arrow_table =
pq.read_table("gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet",
filesystem=fs)
+```
+
+3. Integrating with
[Ray](https://docs.ray.io/en/latest/data/loading-data.html#loading-data)(2.10.0).
+
+```python
+from gravitino import gvfs
+import ray
+
+fs = gvfs.GravitinoVirtualFileSystem(
+ server_uri="http://localhost:8090", metalake_name="test_metalake"
+)
+
+# read a parquet file as ray dataset
+ds =
ray.data.read_parquet("gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet",fs)
+```
+
+4. Integrating with
[LlamaIndex](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/#support-for-external-filesystems)(0.10.40).
+
+```python
+from gravitino import gvfs
+from llama_index.core import SimpleDirectoryReader
+
+fs = gvfs.GravitinoVirtualFileSystem(server_uri=server_uri,
metalake_name=metalake_name)
+
+# read all document files like csv files under the fileset sub dir
+reader = SimpleDirectoryReader(
+ input_dir='fileset/fileset_catalog/tmp/tmp_fileset/sub_dir',
+ fs=fs,
+ recursive=True, # recursively searches all subdirectories
+)
+documents = reader.load_data()
+print(documents)
+```