[jira] [Commented] (ARROW-10937) ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)

Vladimir (Jira) Wed, 16 Dec 2020 13:28:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250647#comment-17250647
 ]


Vladimir commented on ARROW-10937:
----------------------------------

Hello [~jorisvandenbossche], [~apitrou]

On your questions:

1. Calling "fs.get_file_info" returns the full list of files in both 
pyarrow-1.0.1 and 2.0.0:
{code:java}
[<FileInfo for 
'bucket/test_pyarrow.parquet/Year=1990/6583990469864a579b4a7a579b81bec4.parquet':
 type=FileType.File, size=20402>,
 <FileInfo for 
'bucket/test_pyarrow.parquet/Year=1991/03ab50a5f5c9449bb10c8440358c7e35.parquet':
 type=FileType.File, size=20404>,
...{code}
2. Calling pq.read_table without prefix "s3://" prefix 
("pq.read_table('bucket/test_pyarrow.parquet', filesystem=filesystem, 
use_pandas_metadata=True)" returns empty pyarrow.Table - also both in 1.0.1 and 
2.0.0

3. Correction to my original description: reading partitioned dataset via 
pandas as 
{code:python}
pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code}
returns empty dataframe in *both* 1.0.1 and 2.0.0. If I supply "filesystem" 
argument to pd.read_parquet then 1.0.1 reads dataset properly, and 2.0.0 raises 
the same ArrowInvalid exception.

 

> ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10937
>                 URL: https://issues.apache.org/jira/browse/ARROW-10937
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Vladimir
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Hello
> It looks like pyarrow-2.0.0 could not read partitioned datasets from S3 
> buckets: 
> {code:java}
> import s3fs
> import pyarrow as pa
> import pyarrow.parquet as pq
> filesystem = s3fs.S3FileSystem()
> d = pd.date_range('1990-01-01', freq='D', periods=10000)
> vals = np.random.randn(len(d), 4)
> x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
> x['Year'] = x.index.year
> table = pa.Table.from_pandas(x, preserve_index=True)
> pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet', 
> partition_cols=['Year'], filesystem=filesystem)
> {code}
>  
>  Now, reading it via pq.read_table:
> {code:java}
> pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem, 
> use_pandas_metadata=True)
> {code}
> Raises exception: 
> {code:java}
> ArrowInvalid: GetFileInfo() yielded path 
> 'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet',
>  which is outside base dir 's3://bucket/test_pyarrow.parquet'
> {code}
>  
> Direct read in pandas:
> {code:java}
> pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code}
> returns empty DataFrame.
>  
> The issue does not exist in pyarrow-1.0.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10937) ArrowInvalid error on reading partitioned parquet files from S3 (arrow-2.0.0)

Reply via email to