Vladimir created ARROW-10937:
--------------------------------
Summary: ArrowInvalid error on reading partitioned parquet files
from S3 (arrow-2.0.0)
Key: ARROW-10937
URL: https://issues.apache.org/jira/browse/ARROW-10937
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 2.0.0
Reporter: Vladimir
Hello
It looks like pyarrow-2.0.0 has problems in reading parquet could not read
partitioned datasets from S3 buckets:
{code:java}
import s3fs
import pyarrow as pa
import pyarrow.parquet as pq
filesystem = s3fs.S3FileSystem()
d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
x['Year'] = x.index.year
table = pa.Table.from_pandas(x, preserve_index=True)
pq.write_to_dataset(table, root_path='s3://bucket/test_pyarrow.parquet',
partition_cols=['Year'], filesystem=filesystem)
{code}
Now, reading it via pq.read_table:
{code:java}
pq.read_table('s3://bucket/test_pyarrow.parquet', filesystem=filesystem,
use_pandas_metadata=True)
{code}
Raises exception:
{code:java}
ArrowInvalid: GetFileInfo() yielded path
'bucket/test_pyarrow.parquet/Year=2017/ffcc136787cf46a18e8cc8f72958453f.parquet',
which is outside base dir 's3://bucket/test_pyarrow.parquet'
{code}
Direct read in pandas:
{code:java}
pd.read_parquet('s3://bucket/test_pyarrow.parquet'){code}
returns empty DataFrame.
The issue does not exist in pyarrow-1.0.1
--
This message was sent by Atlassian Jira
(v8.3.4#803005)