[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

Jira Fri, 24 Jan 2020 04:57:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022838#comment-17022838
 ]


Fabian Höring edited comment on ARROW-7584 at 1/24/20 12:56 PM:
----------------------------------------------------------------

Thanks for the tip for pickle. I will have a look.

Indeed if the FileSystem API is only intended for internal use then it doesn't 
make any sense to change it.

Unfortunately currently there is no project that has a clean generic FileSystem 
API for S3 and HDFS. The only viable concurrent is TensorFlow and I think it 
should not be Tensorflow's role to define a generic FileSystem API.

I don't intend to create a generic FileSystem project on on my own. I would 
prefer to improve PyArrows FileSystem API to get it reusable by other projects. 
Reusable could even mean "interactive mode" and to replace the fs bash shell 
commands (i.e. hdfs dfs -ls) and only use Python. That's how we used dask/hdfs3 
and it worked well. But unfortunately it is deprecated in favor of PyArrow and 
doesn't support viewfs.




was (Author: fhoering):
Thanks for the tip for pickle. I will have a look.

Indeed if the FileSystem API is only intended for internal use then it doesn't 
make any sense to change it.

Unfortunately currently there is no project that has a clean generic FileSystem 
API for S3 and HDFS. The only viable concurrent is TensorFlow and I think it 
should not be Tensorflow's role to define a generic FileSystem API.

I don't intend to create a generic FileSystem project one on my own. I would 
prefer to improve PyArrows FileSystem API to get reusable by other projects. 
Reusable could even mean "interactive mode" and to replace the fs bash shell 
commands (i.e. hdfs dfs -ls) and only use Python. That's how we used dask/hdfs3 
and it worked well. But unfortunately it is deprecated in favor of PyArrow and 
doesn't support viewfs.



> [Python] Improve ergonomics of new FileSystem API
> -------------------------------------------------
>
>                 Key: ARROW-7584
>                 URL: https://issues.apache.org/jira/browse/ARROW-7584
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Fabian Höring
>            Priority: Major
>              Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
>     I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
>     Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
>     Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - implementing 
> [readline|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L809], 
> needed for:
>  {code}
> with hdfs.open("file", 'wb') as outfile:
>   pickle.dump({"a": "b"}, outfile)
> with hdfs.open("file", 'wb') as infile:
>   pickle.load(infile) 
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

Reply via email to