[jira] [Updated] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

Joris Van den Bossche (Jira) Tue, 08 Sep 2020 02:16:09 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-9938:
-----------------------------------------
    Description: 
In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files or buffers. So for those, you need to do the more manual (I 
_suppose_ this works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
    table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]


  was:
In the parquet IO functions, we support reading/writing files from non-local 
filesystems directly (in addition to passing a buffer) by:

- passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
- specifying the filesystem keyword (eg 
{{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 

On the other hand, for other file formats such as feather, we only support 
local files or buffers. So for those, you need to do the more manual (I 
_suppose_ this works?):

{code:python}
from pyarrow import fs, feather

s3 = fs.S3FileSystem()

with s3.open_input_file("bucket/data.arrow") as file:
      table = feather.read_table(file)
{code}

So I think the question comes up: do we want to extend this filesystem support 
to other file formats (feather, csv, json) and make this more uniform across 
pyarrow, or do we prefer to keep the plain readers more low-level (and people 
can use the datasets API for more convenience)?

cc [~apitrou] [~kszucs]



> [Python] Add filesystem capabilities to other IO formats (feather, csv, json, 
> ..)?
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-9938
>                 URL: https://issues.apache.org/jira/browse/ARROW-9938
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: filesystem
>
> In the parquet IO functions, we support reading/writing files from non-local 
> filesystems directly (in addition to passing a buffer) by:
> - passing a URI (eg {{pq.read_parquet("s3://bucket/data.parquet")}})
> - specifying the filesystem keyword (eg 
> {{pq.read_parquet("bucket/data.parquet", filesystem=S3FileSystem(...))}}) 
> On the other hand, for other file formats such as feather, we only support 
> local files or buffers. So for those, you need to do the more manual (I 
> _suppose_ this works?):
> {code:python}
> from pyarrow import fs, feather
> s3 = fs.S3FileSystem()
> with s3.open_input_file("bucket/data.arrow") as file:
>     table = feather.read_table(file)
> {code}
> So I think the question comes up: do we want to extend this filesystem 
> support to other file formats (feather, csv, json) and make this more uniform 
> across pyarrow, or do we prefer to keep the plain readers more low-level (and 
> people can use the datasets API for more convenience)?
> cc [~apitrou] [~kszucs]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9938) [Python] Add filesystem capabilities to other IO formats (feather, csv, json, ..)?

Reply via email to