[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

Wes McKinney (JIRA) Thu, 15 Aug 2019 07:29:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908138#comment-16908138
 ]


Wes McKinney commented on ARROW-6058:
-------------------------------------

So far we don't have a minimal reproduction of the issue so it's very hard for 
other developers in this project to help. Since you are encountering the 
problem, you are the best positioned to reproduce the issue or determine the 
root cause. 

> [Python][Parquet] Failure when reading Parquet file from S3 
> ------------------------------------------------------------
>
>                 Key: ARROW-6058
>                 URL: https://issues.apache.org/jira/browse/ARROW-6058
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.14.1
>            Reporter: Siddharth
>            Priority: Major
>              Labels: parquet
>
> I am reading parquet data from S3 and get  ArrowIOError error.
> Size of the data: 32 part files 90 MB each (3GB approx)
> Number of records: Approx 100M
> Code Snippet:
> {code:java}
> from s3fs import S3FileSystem
> import pyarrow.parquet as pq
> s3 = S3FileSystem()
> dataset = pq.ParquetDataset("s3://location", filesystem=s3)
> df = dataset.read_pandas().to_pandas()
> {code}
> Stack Trace:
> {code:java}
> df = dataset.read_pandas().to_pandas()
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1113, in read_pandas
> return self.read(use_pandas_metadata=True, **kwargs)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 
> 1085, in read
> use_pandas_metadata=use_pandas_metadata)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, 
> in read
> table = reader.read(**options)
> File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, 
> in read
> use_threads=use_threads)
> File "pyarrow/_parquet.pyx", line 1086, in 
> pyarrow._parquet.ParquetReader.read_all
> File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) 
> than expected (263929)
> {code}
>  
> *Note: Same code works on relatively smaller dataset (approx < 50M records)* 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3

Reply via email to