Siddharth created ARROW-6058:
--------------------------------
Summary: pyarrow.lib.ArrowIOError: Unexpected end of stream: Page
was smaller than expected
Key: ARROW-6058
URL: https://issues.apache.org/jira/browse/ARROW-6058
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 0.14.1
Reporter: Siddharth
I am reading parquet data from S3 and get ArrowIOError error.
Size of the data: 32 part files 90 MB each (3GB approx)
Number of records: Approx 100M
Code Snippet:
```
from s3fs import S3FileSystem
import pyarrow.parquet as pq
s3 = S3FileSystem()
dataset = pq.ParquetDataset("s3://location", filesystem=s3)
df = dataset.read_pandas().to_pandas()
```
Stack Trace:
```
df = dataset.read_pandas().to_pandas()
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1113,
in read_pandas
return self.read(use_pandas_metadata=True, **kwargs)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 1085,
in read
use_pandas_metadata=use_pandas_metadata)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583,
in read
table = reader.read(**options)
File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216,
in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1086, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092)
than expected (263929)
```
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)