[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

Joris Van den Bossche (Jira) Fri, 14 May 2021 06:36:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344597#comment-17344597
 ]


Joris Van den Bossche commented on ARROW-11469:
-----------------------------------------------

Thanks for those analyses!

Something else I am wondering: in this specific case, there is actually no 
projection to be done. Would it be worth to also add a special case for this, 
assuming that checking the exact equality of schemas is faster than 
reprojecting the batch to the same schema (although for many column, checking 
schema equality might also be slow?)

> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
>                 Key: ARROW-11469
>                 URL: https://issues.apache.org/jira/browse/ARROW-11469
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>            Reporter: Axel G
>            Priority: Minor
>         Attachments: image-2021-05-03-14-31-41-260.png, 
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png, 
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

Reply via email to