[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904082#comment-16904082 ]
Andrey Krivonogov commented on ARROW-6058: ------------------------------------------ Hi [~wesmckinn], I have experienced the same issue as [~sid88in] I also managed to reproduce it with synthetic data: {code:java} import numpy as np import pyarrow as pa import pyarrow.parquet as pq import s3fs table = pa.Table.from_arrays([pa.array(np.arange(3 * 10 ** 7), type=pa.int64())], ['col']) path = 's3://bucket/path/0.parquet' fs = s3fs.S3FileSystem() pq.write_table(table, path, filesystem=fs, row_group_size=10 ** 7) table_read = pq.read_table(path, filesystem=fs){code} this snippet raises similar {code:java} ArrowIOError: Unexpected end of stream: Page was smaller (241959) than expected (524605) {code} This problem seemed to be in s3fs version. Package versions I have {code:java} python 3.6.7 packages installed with conda (via conda-forge) boto3==1.9.204 botocore==1.12.204 numpy==1.16.2 pyarrow==0.14.1 {code} and it raised with {code:java} s3fs==0.3.3{code} but everything worked fine with {code:java} s3fs==0.2.2 {code} Thank you in advance for your help ! > [Python][Parquet] Failure when reading Parquet file from S3 > ------------------------------------------------------------ > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 0.14.1 > Reporter: Siddharth > Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)