[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

Wong Chung Hoi (JIRA) Thu, 15 Aug 2019 19:04:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908622#comment-16908622
 ]


Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:03 AM:
----------------------------------------------------------------

Hi all,

below is a simple piece of code to reproduce the issue using:

 
{code:java}
s3fs==0.3.3
pyarrow==0.14.1
pandas==0.24.0 
{code}
 

The file generated is roughly 170MB

 
{code:java}
import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 10000, (10000000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')
{code}
{code:java}
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
 pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599){code}
 


was (Author: hoi):
Hi all,

below is a simple piece of code to reproduce the issue using:

s3fs==0.3.3

pyarrow==0.14.1

pandas==0.24.0

 

The file generated is roughly 170MB

```

import pandas as pd
>>> import numpy as np
>>> pd.DataFrame(np.random.randint(0, 10000, (10000000, 10)), columns=[str(i) 
>>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet')
>>> pd.read_parquet('s3://path/to/file.snappy.parquet')

```

```
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 282, in read_parquet
 return impl.read(path, columns=columns, **kwargs)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 129, in read
 **kwargs).to_pandas()
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 1216, in read_table
 use_pandas_metadata=use_pandas_metadata)
 File 
"/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py",
 line 216, in read
 use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 1086, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) 
than expected (979599)

```

 

> [Python][Parquet] Failure when reading Parquet file from S3 
> ------------------------------------------------------------
>
>                 Key: ARROW-6058
>                 URL: https://issues.apache.org/jira/browse/ARROW-6058
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.14.1
>            Reporter: Siddharth
>            Priority: Major
>              Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

Reply via email to