[jira] [Commented] (ARROW-9827) [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X

Joris Van den Bossche (Jira) Mon, 24 Aug 2020 05:03:27 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183184#comment-17183184
 ]


Joris Van den Bossche commented on ARROW-9827:
----------------------------------------------

[~kyleabeauchamp] thanks for moving the issue here.

In pyarrow 1.0, we switched the default for {{pyarrow.parquet.read_table}} 
(what pandas uses under the hood) to use the new Dataset API (which eg enables 
filtering row groups). When reading the file with 40,000 columns, I get a 
similar result for 1.0 as you got for 0.17 if I disable that:

{code}
In [3]: import pyarrow.parquet as pq   

In [4]: %time table = pq.read_table("test.parquet", use_legacy_dataset=True)    
                                                                                
                                                   
CPU times: user 11.4 s, sys: 296 ms, total: 11.7 s
Wall time: 7.98 s
{code}

But without specifying {{use_legacy_dataset=True}} (so with using the default 
of {{use_legacy_dataset=False}}), it "completed" after more than 4 minutes, but 
then directly segfaulted upon return, for me for this case.

So that's certainly a temporary workaround for you to be able to upgrade to 1.0 
to use this keyword. You can specify it from the pandas interface as well (and 
it will get passed through): {{pd.read_parquet("test.parquet", 
use_legacy_dataset=True)}}

> [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-9827
>                 URL: https://issues.apache.org/jira/browse/ARROW-9827
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>            Reporter: Kyle Beauchamp
>            Priority: Major
>
> I recently tried to update my pyarrow from 0.17.1 to 1.0.0 and I'm 
> encountering a serious bug where wide DataFrames fail during 
> pandas.read_parquet.  Small parquet files (m=10000) read correctly, medium 
> files (m=40000) fail with a "Bus Error: 10", and large files (m=100000) 
> completely hang.  I've tried python 3.8.5, pandas 1.0.5, pyarrow 1.0.0, and 
> OSX 10.14.   
> The driver code and output is below:
> {code:python}
> import pandas as pd
> import numpy as np
> import sys
> filename = "test.parquet"
> n = 10
> m = int(sys.argv[1])
> print(m)
> x = np.zeros((n, m))
> x = pd.DataFrame(x, columns=[f"A_{i}" for i in range(m)])
> x.to_parquet(filename)
> y = pd.read_parquet(filename, engine='pyarrow')
> {code}
> {code:java}
> time python test_pyarrow.py  10000
> real 0m4.018s user 0m5.286s sys 0m0.514s
> time python test_pyarrow.py  40000
> 40000
> Bus error: 10
> {code}
>  
> On a pyarrow 0.17.1 environment, the 40,000 case completes in 8 seconds.  
> This was cross-posted on the pandas tracker as well: 
> [https://github.com/pandas-dev/pandas/issues/35846]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9827) [Python] pandas.read_parquet fails for wide parquet files and pyarrow 1.0.X

Reply via email to