[
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022923#comment-17022923
]
Fabian Höring edited comment on ARROW-7584 at 1/24/20 1:04 PM:
---------------------------------------------------------------
You mean by wrapping everything on top of PyArrow ?
I would expect the complexity of this class to be something like 1000 lines of
code, basically something like this:
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py
Seems strange to me to do this again and again in every project and I don't
really understand the advantage. Maybe that you can change the FileSystem
internally in PyArrow more easily but when it is done I expect this to be
stable for years. I don't think this will change very often as it must always
stay compatible and the underlying libs are stable and don't change.
How often did this doc change
https://arrow.apache.org/docs/python/filesystems.html ? Seems like never for
years. So for this would change like now and then stay like this.
For some existing wrappers the one your sent and my own:
https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/hdfs.py
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/filesystem.py
was (Author: fhoering):
You mean by wrapping everything on top of PyArrow ?
I would expect the complexity of this class to be something like 1000 lines of
code, basically something like this:
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py
Seems strange to me to do this again and again in very project and I don't
really understand the advantage. Maybe that you can change the FileSystem
internally in PyArrow more easily but when it is done I expect this to be
stable for years. I don't think this will change very often as it must always
stay compatible and the underlying libs are stable and don't change.
How often did this doc change
https://arrow.apache.org/docs/python/filesystems.html ? Seems like never for
years. So for this would change like now and then stay like this.
For some existing wrappers the one your sent and my own:
https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/hdfs.py
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/filesystem.py
> [Python] Improve ergonomics of new FileSystem API
> -------------------------------------------------
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Fabian Höring
> Priority: Major
> Labels: FileSystem
>
> The [new Python FileSystem API
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is
> nice but seems to be very verbose to use.
> The documentation of the old FS API is
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seem clear
> and are much easier to use. Seems like an easy change. Also this is
> consistent with what is doing hdfs in the [fs api|
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for
> local file access as well. Not sure if this is possible to do easily as there
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make
> the FileSystem class a bit messy I think becasue there would be always 2
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see
> https://issues.apache.org/jira/browse/ARROW-7102,
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the
> documented "official" pyarow FileSystem it is fine I think, otherwise I would
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems
> like they will also just wrap the C code back to
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download
> files
> - introduce
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
> from dask/hdfs3
> - check if selector works with globs or add
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes,
> already implemented by
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
> res = {"a": "bc"}
> json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
> res = {"a": "bc"}
> fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested
> with new one)
> {code}with fs.open(path, "wb") as fd:
> res = {"a": "bc"}
> fd.write(json.dumps(res).encode())
> {code}
> - implementing
> [readline|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L809],
> needed for:
> {code}
> with hdfs.open("file", 'wb') as outfile:
> pickle.dump({"a": "b"}, outfile)
> with hdfs.open("file", 'wb') as infile:
> pickle.load(infile)
> {code}
> - not clear how to make this also work when reading from files
--
This message was sent by Atlassian Jira
(v8.3.4#803005)