[
https://issues.apache.org/jira/browse/ARROW-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343494#comment-17343494
]
David Li commented on ARROW-11469:
----------------------------------
Finally, adding passing a fieldref-to-fieldpath cache to Bind() knocks out most
of the rest of the slowdown, with the final result being 0.6s for a scan using
Arrow 0.17 and 0.8s for the patched version here. (Note that master with
use_legacy_dataset=True also takes about 0.8s so the difference may be in the
Parquet reader itself.) Profile: [https://share.firefox.dev/33CGu3N]
[~bkietz] what do you think about optionally passing an
{{unordered_map<FieldRef, FieldPath>}} cache to {{ExecuteScalarExpression}} and
{{Expression::Bind}}? The invariant would have to be that the cache is not used
with a schema that has the same fields but in a different order. However, I
think we can maintain this easily enough, so long as we assume that all batches
from the same fragment have the same schema. We can construct the cache when we
start scanning a given fragment and isolate it to that fragment only. We could
then visit the schema once to build the cache instead of visiting every field
on every fieldref lookup.
(Also note that filtering is untested here and may benefit from this
optimization too, and I didn't try the async scanner.)
> [Python] Performance degradation parquet reading of wide dataframes
> -------------------------------------------------------------------
>
> Key: ARROW-11469
> URL: https://issues.apache.org/jira/browse/ARROW-11469
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0, 1.0.1, 2.0.0, 3.0.0
> Reporter: Axel G
> Priority: Minor
> Attachments: image-2021-05-03-14-31-41-260.png,
> image-2021-05-03-14-39-59-485.png, image-2021-05-03-14-40-09-520.png,
> profile_wide300.svg
>
>
> I noticed a relatively big performance degradation in version 1.0.0+ when
> trying to load wide dataframes.
> For example you should be able to reproduce by doing:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame(np.random.rand(100, 10000))
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "temp.parquet")
> %timeit pd.read_parquet("temp.parquet"){code}
> In version 0.17.0, this takes about 300-400 ms and for anything above and
> including 1.0.0, this suddenly takes around 2 seconds.
>
> Thanks for looking into this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)