[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

David Li (Jira) Fri, 14 May 2021 06:43:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344604#comment-17344604
 ]


David Li commented on ARROW-11469:
----------------------------------

I also thought about that and am of two minds. This case (no projection, or 
perhaps only column selection) is presumably a very common case and could 
warrant some special treatment to ensure it's consistently fast, and of course 
doing no work is much better than doing work quickly. But if some relatively 
simple optimizations would help in all cases, then I think that's worth 
pursuing over a special case.

Maybe it'd be worth benchmarking to ensure the optimizations here are enough of 
a speedup and don't slow down other cases (narrow schemas, selecting only a few 
columns, actual projections, etc.) and that'd both let us know about potential 
future regressions and help us decide if it's worth it.

As for schema equality, if we do special-case things: as with the optimizations 
described here, I think if we can assume that within a fragment, all batches 
will have the same schema, then that should reduce the overhead of checking 
schemas considerably since it'll be only O(fragments) (and, could be pipelined 
with other work).

> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
>                 Key: ARROW-11469
>                 URL: https://issues.apache.org/jira/browse/ARROW-11469
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>            Reporter: Axel G
>            Priority: Minor
>         Attachments: image-2021-05-03-14-31-41-260.png, 
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png, 
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

Reply via email to