Joris Van den Bossche created ARROW-9938:
--------------------------------------------
Summary: [Python] Add filesystem capabilities to other IO formats
(feather, csv, json, ..)?
Key: ARROW-9938
URL: https://issues.apache.org/jira/browse/ARROW-9938
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Joris Van den Bossche
In the parquet IO functions, we support reading/writing files from non-local
filesystems directly (in addition to passing a buffer) by:
- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}})
On the other hand, for other file formats such as feather, we only support
local files. So for those, you need to do the more manual (I _suppose_ this
works?):
{code:python}
from pyarrow import fs, feather
s3 = fs.S3FileSystem()
with s3.open_input_file("bucket/data.arrow") as file:
table = feather.read_table(file)
{code}
So I think the question comes up: do we want to extend this filesystem support
to other file formats (feather, csv, json) and make this more uniform across
pyarrow, or do we prefer to keep the plain readers more low-level (and people
can use the datasets API for more convenience)?
cc [~apitrou] [~kszucs]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)