[jira] [Commented] (ARROW-10937) ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)

Joris Van den Bossche (Jira) Thu, 17 Dec 2020 07:36:34 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251170#comment-17251170
 ]


Joris Van den Bossche commented on ARROW-10937:
-----------------------------------------------

Thanks [~Filimonov]

So your original error shown at the top post, is indeed because of passing a 
file path with starting with "s3://..." while also passing a filesystem object. 
IMO we should improve the handling of that case (or at least improve the error 
message).

Now, that you get an empty table when removing the prefix is strange. Certainly 
given that listing the files seems to work, and shows that there are actually 
files in that folder.

> ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10937
>                 URL: https://issues.apache.org/jira/browse/ARROW-10937
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Vladimir
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Hello
> It looks like pyarrow-2.0.0 could not read partitioned datasets from S3 
> buckets: 
> {code:java}
> import s3fs
> import pyarrow as pa
> import pyarrow.parquet as pq
> filesystem = s3fs.S3FileSystem()
> d = pd.date_range('1990-01-01', freq='D', periods=10000)
> vals = np.random.randn(len(d), 4)
> x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
> x['Year'] = x.index.year
> table = pa.Table.from_pandas(x, preserve_index=True)
> pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', 
> partition_cols=['Year'], filesystem=filesystem)
> {code}
>  
>  Now, reading it via pq.read_table:
> {code:java}
> pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, 
> use_pandas_metadata=True)
> {code}
> Raises exception: 
> {code:java}
> ArrowInvalid: GetFileInfo() yielded path 
> 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet',
>  which is outside base dir 's3://bucket/test_pyarrow.parquet'
> {code}
>  
> Direct read in pandas:
> {code:java}
> pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code}
> returns empty DataFrame.
>  
> The issue does not exist in pyarrow-1.0.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10937) ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)

Reply via email to