[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

David Li (Jira) Wed, 12 May 2021 12:10:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343494#comment-17343494
 ]


David Li commented on ARROW-11469:
----------------------------------

Finally, adding passing a fieldref-to-fieldpath cache to Bind() knocks out most 
of the rest of the slowdown, with the final result being 0.6s for a scan using 
Arrow 0.17 and 0.8s for the patched version here. (Note that master with 
use_legacy_dataset=True also takes about 0.8s so the difference may be in the 
Parquet reader itself.) Profile: [https://share.firefox.dev/33CGu3N]

[~bkietz] what do you think about optionally passing an 
{{unordered_map<FieldRef, FieldPath>}} cache to {{ExecuteScalarExpression}} and 
{{Expression::Bind}}? The invariant would have to be that the cache is not used 
with a schema that has the same fields but in a different order. However, I 
think we can maintain this easily enough, so long as we assume that all batches 
from the same fragment have the same schema. We can construct the cache when we 
start scanning a given fragment and isolate it to that fragment only. We could 
then visit the schema once to build the cache instead of visiting every field 
on every fieldref lookup.

(Also note that filtering is untested here and may benefit from this 
optimization too, and I didn't try the async scanner.)

> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
>                 Key: ARROW-11469
>                 URL: https://issues.apache.org/jira/browse/ARROW-11469
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
>            Reporter: Axel G
>            Priority: Minor
>         Attachments: image-2021-05-03-14-31-41-260.png, 
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png, 
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when 
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and 
> including 1.0.0, this suddenly takes around 2 seconds.
>  
> Thanks for looking into this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11469) [Python] Performance degradation parquet reading of wide dataframes

Reply via email to