[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

Jira Fri, 31 Jan 2020 13:34:36 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027794#comment-17027794
 ]


Fabian Höring edited comment on ARROW-7584 at 1/31/20 9:33 PM:
---------------------------------------------------------------

Yes I will do it. 
I took the advice from [~jorisvandenbossche]: Take the parts I think are most 
useful.

To start I would like to replace get_target_stats by fsspec compatible info/ls 
methods.
- Rename & wrap it to the new dataset reader
- Return a dictionary instead of FileStats (maybe optional and as a next step)
- Remove selector if possible (as a next step)

If you want to have a look it looks like this (don't worry about the tests, 
they were all green at some point)
https://github.com/fhoering/arrow/commits/introduce_info_ls

I will check on monday how I can keep this consistent with consuming dataset 
module (either by keeping selector or by propagating it all the way up)

Even already all this could be stuff to discuss (the defaults to keep, the 
exposed object Selector, FileStats, FileType, ..). I will send a mail to the 
mailing list if needed, otherwise just the comments in each PR.


was (Author: fhoering):
Yes I will do it. 
I took the advice from [~jorisvandenbossche]: Take the parts I think are most 
useful.

To start I would like to replace get_target_stats by fsspec compatible info/ls 
methods.
- Rename & wrap it to the new dataset reader
- Return a dictionary instead of FileStats (maybe optional and as a next step)
- Remove selector if possible (as a next step)

If you want to have a look it looks like this (don't worry about the tests, 
they were all green at some point)
https://github.com/fhoering/arrow/commits/introduce_info_ls

I will check on monday how I can keep this consistent with consuming dataset 
module (either by keeping selector or by propagating it all the way up)

Even already all this could be stuff to discuss (the defaults to keep, the 
exposed object Selector, FileStats, FileType, ..). So if you have a slack 
channel I can also join it.
Or just the comments in each PR.

> [Python] Improve ergonomics of new FileSystem API
> -------------------------------------------------
>
>                 Key: ARROW-7584
>                 URL: https://issues.apache.org/jira/browse/ARROW-7584
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Fabian Höring
>            Assignee: Fabian Höring
>            Priority: Major
>              Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
>     I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
>     Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
>     Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - implementing 
> [readline|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L809], 
> needed for:
>  {code}
> with hdfs.open("file", 'wb') as outfile:
>   pickle.dump({"a": "b"}, outfile)
> with hdfs.open("file", 'wb') as infile:
>   pickle.load(infile) 
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

Reply via email to