[
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344604#comment-17344604
]
David Li commented on ARROW-11469:
----------------------------------
I also thought about that and am of two minds. This case (no projection, or
perhaps only column selection) is presumably a very common case and could
warrant some special treatment to ensure it's consistently fast, and of course
doing no work is much better than doing work quickly. But if some relatively
simple optimizations would help in all cases, then I think that's worth
pursuing over a special case.
Maybe it'd be worth benchmarking to ensure the optimizations here are enough of
a speedup and don't slow down other cases (narrow schemas, selecting only a few
columns, actual projections, etc.) and that'd both let us know about potential
future regressions and help us decide if it's worth it.
As for schema equality, if we do special-case things: as with the optimizations
described here, I think if we can assume that within a fragment, all batches
will have the same schema, then that should reduce the overhead of checking
schemas considerably since it'll be only O(fragments) (and, could be pipelined
with other work).
> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
> Reporter: Axel G
> Priority: Minor
> Attachments: image-2021-05-03-14-31-41-260.png,
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png,
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and
> including 1.0.0, this suddenly takes around 2 seconds.
>
> Thanks for looking into this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)