[ https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fabian Höring updated ARROW-7584: --------------------------------- Description: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3. Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h3. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files was: The [new Python FileSystem API |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is nice but seems to be very verbose to use. The documentation of the old FS API is [here|https://arrow.apache.org/docs/python/filesystems.html] h3. Here are some examples *Filesystem access:* Before: fs.ls() fs.mkdir() fs.rmdir() Now: fs.get_target_stats() fs.create_dir() fs.delete_dir() What is the advantage of having a longer method ? The short ones seems clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the [fs api| https://arrow.apache.org/docs/python/filesystems.html] and works naturally with a local filesystem. *File opening:* Before: with fs.open(self, path, mode=u'rb', buffer_size=None) Now: fs.open_input_file() fs.open_input_stream() fs.open_output_stream() It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method. h3. Proposed solutions - If the current Python API is still unused we could just rename the methods - We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy think if there are always 2 method to do the work h3: Tensorflow RFC on FileSystems Tensorflow is also doing some standardization work on their FileSystem: https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations Not clear (to me) what they will do with Python file API though. it seems like they will als ojus twrap the C code back to [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] h3. Other considerations on FS ergonomics In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example: - introduce put and get on top of the streams that directly upload/download files - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] - check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 - be able to write strings to the file streams (instead of only bytes), it would permit to directly use some Python API's like json.dump {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd) {code} instead of {code} with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res)) {code} or like currently (with old API, which required encore each time, untested with new one) {code}with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode()) {code} - not clear how to make this also work when reading from files > [Python] Improve ergonomics of new FileSystem API > ------------------------------------------------- > > Key: ARROW-7584 > URL: https://issues.apache.org/jira/browse/ARROW-7584 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Fabian Höring > Priority: Major > > The [new Python FileSystem API > |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is > nice but seems to be very verbose to use. > The documentation of the old FS API is > [here|https://arrow.apache.org/docs/python/filesystems.html] > h3. Here are some examples > *Filesystem access:* > Before: > fs.ls() > fs.mkdir() > fs.rmdir() > Now: > fs.get_target_stats() > fs.create_dir() > fs.delete_dir() > What is the advantage of having a longer method ? The short ones seems clear > and are much easier to use. Seems like an easy change. Also this is > consistent with what is doing hdfs in the [fs api| > https://arrow.apache.org/docs/python/filesystems.html] and works naturally > with a local filesystem. > *File opening:* > Before: > with fs.open(self, path, mode=u'rb', buffer_size=None) > Now: > fs.open_input_file() > fs.open_input_stream() > fs.open_output_stream() > It seems more natural to fit to Python standard open function which works for > local file access as well. Not sure if this is possible to do easily as there > is `_wrap_output_stream` method. > h3. Proposed solutions > - If the current Python API is still unused we could just rename the methods > - We could keep everything as is and add some alias methods, it would make > the FileSystem class a bit messy think if there are always 2 method to do the > work > h3. Tensorflow RFC on FileSystems > Tensorflow is also doing some standardization work on their FileSystem: > https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations > Not clear (to me) what they will do with Python file API though. it seems > like they will als ojus twrap the C code back to > [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile] > h3. Other considerations on FS ergonomics > In the long run I would also like to enhance the FileSystem API and add more > methods that use the basic ones to provide new features for example: > - introduce put and get on top of the streams that directly upload/download > files > - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] > - check if selector works with globs or add > https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349 > - be able to write strings to the file streams (instead of only bytes), it > would permit to directly use some Python API's like json.dump > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > json.dump(res, fd) > {code} > instead of > {code} > with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res)) > {code} > or like currently (with old API, which required encore each time, untested > with new one) > {code}with fs.open(path, "wb") as fd: > res = {"a": "bc"} > fd.write(json.dumps(res).encode()) > {code} > - not clear how to make this also work when reading from files -- This message was sent by Atlassian Jira (v8.3.4#803005)