[jira] [Commented] (ARROW-9827) [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X

Joris Van den Bossche (Jira) Mon, 24 Aug 2020 06:14:43 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183268#comment-17183268
 ]


Joris Van den Bossche commented on ARROW-9827:
----------------------------------------------

As I suspected, this is due to the reading of metadata, and especially the 
column statistics. In the Dataset API, we by default parse all metadata 
including statistics, even when there is no filter specified. 

When disabling the parsing of statistics, the file with 40,000 columns reads in 
2.5 seconds (will open a PR with that).

Further, with the released version of pyarrow, using ParquetFile is actually 
the fastest option right now:

{code}
In [7]: %time table = pq.ParquetFile("test.parquet").read()                     
                                                                                
                                                   
CPU times: user 7.89 s, sys: 253 ms, total: 8.14 s
Wall time: 4.5 s
{code}


Additional sidenote: the Parquet file format is generally not well suited for 
such wide tables, because of the column-heavy metadata. 
For example, reading a Parquet file with the same number of values, but in long 
format (40,000 rows x 10 columns) reads in less than 10ms instead of in 
multiple seconds)
 

> [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-9827
>                 URL: https://issues.apache.org/jira/browse/ARROW-9827
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>            Reporter: Kyle Beauchamp
>            Priority: Major
>
> I recently tried to update my pyarrow from 0.17.1 to 1.0.0 and I'm 
> encountering a serious bug where wide DataFrames fail during 
> pandas.read_parquet.  Small parquet files (m=10000) read correctly, medium 
> files (m=40000) fail with a "Bus Error: 10", and large files (m=100000) 
> completely hang.  I've tried python 3.8.5, pandas 1.0.5, pyarrow 1.0.0, and 
> OSX 10.14.   
> The driver code and output is below:
> {code:python}
> import pandas as pd
> import numpy as np
> import sys
> filename = "test.parquet"
> n = 10
> m = int(sys.argv[1])
> print(m)
> x = np.zeros((n, m))
> x = pd.DataFrame(x, columns=[f"A_{i}" for i in range(m)])
> x.to_parquet(filename)
> y = pd.read_parquet(filename, engine='pyarrow')
> {code}
> {code:java}
> time python test_pyarrow.py  10000
> real 0m4.018s user 0m5.286s sys 0m0.514s
> time python test_pyarrow.py  40000
> 40000
> Bus error: 10
> {code}
>  
> On a pyarrow 0.17.1 environment, the 40,000 case completes in 8 seconds.  
> This was cross-posted on the pandas tracker as well: 
> [https://github.com/pandas-dev/pandas/issues/35846]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9827) [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X

Reply via email to