[
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fabian Höring updated ARROW-7584:
---------------------------------
Description:
The [new Python FileSystem API
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is
nice but seems to be very verbose to use.
The documentation of the old FS API is
[here|https://arrow.apache.org/docs/python/filesystems.html]
h3. Here are some examples
*Filesystem access:*
Before:
fs.ls()
fs.mkdir()
fs.rmdir()
Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()
What is the advantage of having a longer method ? The short ones seems clear
and are much easier to use. Seems like an easy change. Also this is consistent
with what is doing hdfs in the [fs api|
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with
a local filesystem.
*File opening:*
Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)
Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()
It seems more natural to fit to Python standard open function which works for
local file access as well. Not sure if this is possible to do easily as there
is `_wrap_output_stream` method.
h3. Solutions
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the
FileSystem class a bit messy think if there are always 2 method to do the work
h3. Other considerations on ergonomics
In the long run I would also like to enhance the FileSystem API and add more
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it
would permit to directly use some Python API's like json.dump
{code}
with fs.open(path, "wb") as fd:
res = {"a": "bc"}
json.dump(res, fd)
{code}
instead of
{code}
with fs.open(path, "wb") as fd:
res = {"a": "bc"}
fd.write(json.dumps(res))
{code}
or like currently (with old API, which required encore each time, untested with
new one)
{code}with fs.open(path, "wb") as fd:
res = {"a": "bc"}
fd.write(json.dumps(res).encode())
{code}
- not clear how to make this also work when reading from files
was:
The [new Python FileSystem API
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is
nice but seems to be very verbose to use.
The documentation of the old FS API is
[here|https://arrow.apache.org/docs/python/filesystems.html]
h3. Here are some examples
*File access:*
Before:
fs.ls()
fs.mkdir()
fs.rmdir()
Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()
What is the advantage of having a longer method ? The short ones seems clear
and are much easier to use. Seems like an easy change. Also this is consistent
with what is doing hdfs in the [fs api|
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with
a local filesystem.
*File opening:*
Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)
Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()
It seems more natural to fit to Python standard open function which works for
local file access as well. Not sure if this is possible to do easily as there
is `_wrap_output_stream` method.
h3. Solutions
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the
FileSystem class a bit messy think if there are always 2 method to do the work
h3. Other considerations on ergonomics
In the long run I would also like to enhance the FileSystem API and add more
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it
would permit to directly use some Python API's like json.dump
{code}
with fs.open(path, "wb") as fd:
res = {"a": "bc"}
json.dump(res, fd)
{code}
instead of
{code}
with fs.open(path, "wb") as fd:
res = {"a": "bc"}
fd.write(json.dumps(res))
{code}
or like currently (with old API, which required encore each time, untested with
new one)
{code}with fs.open(path, "wb") as fd:
res = {"a": "bc"}
fd.write(json.dumps(res).encode())
{code}
- not clear how to make this also work when reading from files
> [Python] Improve ergonomics of new FileSystem API
> -------------------------------------------------
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Fabian Höring
> Priority: Major
>
> The [new Python FileSystem API
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is
> nice but seems to be very verbose to use.
> The documentation of the old FS API is
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h3. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear
> and are much easier to use. Seems like an easy change. Also this is
> consistent with what is doing hdfs in the [fs api|
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for
> local file access as well. Not sure if this is possible to do easily as there
> is `_wrap_output_stream` method.
> h3. Solutions
> - If the current Python API is still unused we could just rename the methods
> - We could everything as is and add some alias methods, it would make the
> FileSystem class a bit messy think if there are always 2 method to do the work
> h3. Other considerations on ergonomics
> In the long run I would also like to enhance the FileSystem API and add more
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download
> files
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
> - check if selector works with globs or add
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes), it
> would permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
> res = {"a": "bc"}
> json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
> res = {"a": "bc"}
> fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested
> with new one)
> {code}with fs.open(path, "wb") as fd:
> res = {"a": "bc"}
> fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files
--
This message was sent by Atlassian Jira
(v8.3.4#803005)