[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908622#comment-16908622 ]
Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:03 AM: ---------------------------------------------------------------- Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 10000, (10000000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} was (Author: hoi): Hi all, below is a simple piece of code to reproduce the issue using: s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 The file generated is roughly 170MB ``` import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 10000, (10000000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599) ``` > [Python][Parquet] Failure when reading Parquet file from S3 > ------------------------------------------------------------ > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 0.14.1 > Reporter: Siddharth > Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)