[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-6876: ----------------------------------------- Summary: [Python] Reading parquet file with many columns becomes slow for 0.15.0 (was: [Python] Reading parquet file becomes really slow for 0.15.0) > [Python] Reading parquet file with many columns becomes slow for 0.15.0 > ----------------------------------------------------------------------- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.0 > Environment: python3.7 > Reporter: Bob > Assignee: Wes McKinney > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)