[
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344597#comment-17344597
]
Joris Van den Bossche commented on ARROW-11469:
-----------------------------------------------
Thanks for those analyses!
Something else I am wondering: in this specific case, there is actually no
projection to be done. Would it be worth to also add a special case for this,
assuming that checking the exact equality of schemas is faster than
reprojecting the batch to the same schema (although for many column, checking
schema equality might also be slow?)
> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
> Reporter: Axel G
> Priority: Minor
> Attachments: image-2021-05-03-14-31-41-260.png,
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png,
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and
> including 1.0.0, this suddenly takes around 2 seconds.
>
> Thanks for looking into this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)