[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908648#comment-16908648 ]
Wes McKinney commented on ARROW-6058: ------------------------------------- Thank you, that's great! I added to the 0.15.0 milestone. I've been working a lot on Parquet stuff lately so if no one looks at it first I'll try to look before the release horizon closes > [Python][Parquet] Failure when reading Parquet file from S3 > ------------------------------------------------------------ > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 0.14.1 > Reporter: Siddharth > Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)