[
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951175#comment-16951175
]
Joris Van den Bossche commented on ARROW-6876:
----------------------------------------------
Thanks, if it is just floats, I'll try to reproduce based on that description.
But it's probably related to the fact that you have a very wide dataframe (n
columns >> n rows). In general, the parquet is not very suited for that kind of
data (also in 0.14 the 2 seconds to read is very slow). But that said, it's
still a performance regression compared to 0.14 that is worth looking into.
> Reading parquet file becomes really slow for 0.15.0
> ---------------------------------------------------
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.0
> Environment: python3.7
> Reporter: Bob
> Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png,
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>
> I just noticed that reading a parquet file becomes really slow after I
> upgraded to 0.15.0 when using pandas.
>
> Example:
> *With 0.14.1*
> In [4]: %timeit df = pd.read_parquet(path)
> 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
> In [5]: %timeit df = pd.read_parquet(path)
> 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>
> The file is about 15MB in size. I am testing on the same machine using the
> same version of python and pandas.
>
> Have you received similar complain? What could be the issue here?
>
> Thanks a lot.
>
>
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)