Siddharth created ARROW-6058: -------------------------------- Summary: pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller than expected Key: ARROW-6058 URL: https://issues.apache.org/jira/browse/ARROW-6058 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.14.1 Reporter: Siddharth
I am reading parquet data from S3 and get ArrowIOError error. Size of the data: 32 part files 90 MB each (3GB approx) Number of records: Approx 100M Code Snippet: ``` from s3fs import S3FileSystem import pyarrow.parquet as pq s3 = S3FileSystem() dataset = pq.ParquetDataset("s3://location", filesystem=s3) df = dataset.read_pandas().to_pandas() ``` Stack Trace: ``` df = dataset.read_pandas().to_pandas() File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113, in read_pandas return self.read(use_pandas_metadata=True, **kwargs) File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085, in read use_pandas_metadata=use_pandas_metadata) File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in read table = reader.read(**options) File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) than expected (263929) ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)