Re: [PR] [#3764] improvement(docs): Add user docs for using GVFS in Python [gravitino]

via GitHub Tue, 09 Jul 2024 02:59:14 -0700


xloya commented on code in PR #3931:
URL: https://github.com/apache/gravitino/pull/3931#discussion_r1670177143



##########
docs/how-to-use-gvfs.md:
##########
@@ -307,3 +315,210 @@ conf.set("fs.gravitino.client.kerberos.keytabFilePath", 
"${your_kerberos_keytab}
 Path filesetPath = new 
Path("gvfs://fileset/test_catalog/test_schema/test_fileset_1");
 FileSystem fs = filesetPath.getFileSystem(conf);
 ```
+
+## 2. Managing files of Fileset with Python GVFS
+
+### Prerequisites
+
++ A Hadoop environment with HDFS running. Now we only supports Fileset on HDFS.
+  GVFS in Python has been tested against Hadoop 2.7.3. It is recommended to 
use Hadoop 2.7.3 or later,
+  it should work with Hadoop 3.x. Please create an 
[issue](https://www.github.com/datastrato/gravitino/issues)
+  if you find any compatibility issues.
++ Python version >= 3.8. It has been tested GVFS works well with Python 3.8 
and Python 3.9.
+  Your Python version should be at least higher than Python 3.8.
+
+Attention: If you are using macOS or Windows operating system, you need to 
follow the steps in the
+[Hadoop official building 
documentation](https://github.com/apache/hadoop/blob/trunk/BUILDING.txt)(Need 
match your Hadoop version)
+to recompile the native libraries like `libhdfs` and others, and completely 
replace the files in `${HADOOP_HOME}/lib/native`.
+
+### Configuration
+
+| Configuration item   | Description                                           
                                                                    | Default 
value | Required | Since version |
+|----------------------|---------------------------------------------------------------------------------------------------------------------------|---------------|----------|---------------|
+| `server_uri`         | The Gravitino server uri, e.g. 
`http://localhost:8090`.                                                        
           | (none)        | Yes      | 0.6.0         |.                        
                                                                                
        | (none)        | Yes                                 | 0.6.0         |
+| `metalake_name`      | The metalake name which the fileset belongs to.       
                                                                    | (none)    
    | Yes      | 0.6.0         |.                                               
                                                                 |  (none)      
  | Yes                                 | 0.6.0         | .                     
          | (none)        | Yes      | 0.6.0         |
+| `cache_size`         | The cache capacity of the Gravitino Virtual File 
System.                                                                  | `20` 
         | No       | 0.6.0         |.                                          
                                                                      |  (none) 
       | Yes                                 | 0.6.0         | .                
               | (none)        | Yes      | 0.6.0         |
+| `cache_expired_time` | The value of time that the cache expires after 
accessing in the Gravitino Virtual File System. The value is in `seconds`. | 
`3600`        | No       | 0.6.0         |.                                     
                                                                           |  
(none)        | Yes                                 | 0.6.0         | .         
                      | (none)        | Yes      | 0.6.0         |
+
+
+You can configure these properties when obtaining the `Gravitino Virtual 
FileSystem` in Python like this:
+
+```python
+from gravitino import gvfs
+
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090";, 
metalake_name="test_metalake")
+```
+
+### Usage examples
+
+1. Make sure to obtain the Gravitino library.
+   You can get it by [pip](https://pip.pypa.io/en/stable/installation/):
+
+```shell
+pip install gravitino
+```
+
+2. Configuring the Hadoop environment.
+   You should ensure that the Python client has Kerberos authentication 
information and
+   configure Hadoop environments in the system environment:
+```shell
+# kinit kerberos
+kinit -kt /tmp/xxx.keytab [email protected]
+# Or you can configure kerberos information in the Hadoop `core-site.xml` file
+<property>
+  <name>hadoop.security.authentication</name>
+  <value>kerberos</value>
+</property>
+
+<property>
+  <name>hadoop.client.kerberos.principal</name>
+  <value>[email protected]</value>
+</property>
+
+<property>
+  <name>hadoop.client.keytab.file</name>
+  <value>/tmp/xxx.keytab</value>
+</property>
+# Configure Hadoop env in Linux
+export HADOOP_HOME=${YOUR_HADOOP_PATH}
+export HADOOP_CONF_DIR=${YOUR_HADOOP_PATH}/etc/hadoop
+export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
+```
+
+#### Via fsspec-style interface
+
+You can use the fsspec-style interface to perform operations on the fileset 
files.
+
+For example:
+
+```python
+from gravitino import gvfs
+
+# init the gvfs
+fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090";, 
metalake_name="test_metalake")
+
+# list file infos under the fileset
+fs.ls(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir")
+
+# get file info under the fileset
+fs.info(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.parquet")
+
+# check a file or a diretory whether exists
+fs.exists(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir")
+
+# write something into a file
+with 
fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt", 
mode="wb") as output_stream:
+    output_stream.write(b"hello world")
+
+# append something into a file
+with 
fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt", 
mode="ab") as append_stream:
+    append_stream.write(b"hello world")
+
+# read something from a file
+with 
fs.open(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt", 
mode="rb") as input_stream:
+    input_stream.read()
+
+# copy a file
+fs.cp_file(path1="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test.txt",
+           
path2="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test-1.txt")
+
+# delete a file
+fs.rm_file(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/ttt/test-1.txt")
+
+# two methods to create a directory
+fs.makedirs(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_2")
+
+fs.mkdir(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_3")
+
+# delete a file or a directory recursively
+fs.rm(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_2", 
recursive=True)
+
+# delete a directory
+fs.rmdir(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir_2")
+
+# move a file or a directory
+fs.mv(path1="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test-1.txt",
+      
path2="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/sub_dir/test-2.txt")
+
+# get the content of a file
+fs.cat_file(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test-1.txt")
+
+# copy a remote file to local
+fs.get_file(rpath="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test-1.txt",
+            lpath="/tmp/local-file-1.txt")
+```
+
+#### Integrating with Third-party Python libraries
+
+You can also perform operations on the files or directories managed by fileset
+integrating with some Third-party Python libraries which support fsspec 
compatible filesystems.
+
+For example:
+1. Integrating with 
[Pandas](https://pandas.pydata.org/docs/reference/io.html)(2.0.3).
+```python
+from gravitino import gvfs
+import pandas as pd
+
+data = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'ID': [20, 21, 19, 18]})
+storage_options = {'server_uri': 'http://localhost:8090', 'metalake_name': 
'test_metalake'}
+# save data to a parquet file under the fileset
+data.to_parquet('gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet', 
storage_options=storage_options)
+
+# read data from a parquet file under the fileset
+ds = 
pd.read_parquet(path="gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet",
+                     storage_options=storage_options)
+print(ds)
+
+# save data to a csv file under the fileset
+data.to_csv('gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.csv', 
storage_options=storage_options)
+
+# save data from a csv file under the fileset
+df = pd.read_csv('gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.csv', 
storage_options=storage_options)
+print(df)
+```
+
+2. Integrating with 
[PyArrow](https://arrow.apache.org/docs/python/filesystems.html)(15.0.2).
+```python
+from gravitino import gvfs
+import pandas as pd
+
+fs = gvfs.GravitinoVirtualFileSystem(
+    server_uri="http://localhost:8090";, metalake_name="test_metalake"
+)
+
+# read a parquet file as arrow dataset
+arrow_dataset = 
dt.dataset("gvfs://fileset/fileset_catalog/tmp/tmp_fileset/test.parquet", 
filesystem=fs)

Review Comment:
   There is a problem with the code here, update it to PyArrow lib.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#3764] improvement(docs): Add user docs for using GVFS in Python [gravitino]

Reply via email to