[
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195633#comment-17195633
]
Ben Kietzman commented on ARROW-9924:
-------------------------------------
{quote}
Looking at the top of the hierarchical perf report for the "new" code, the
deeply nested layers of iterators strikes me as one thing to think more about
whether that's the design we want
{quote}
To be clear, is the concern over clarity or performance? IIUC
[https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8#file-gistfile1-txt-L8-L20]
represents minimal cost since 0.65% of runtime was spent managing the Iterator
abstraction. If we wanted to replace our abstraction for lazy sequences we
could potentially refactor to a {{Future<T>}}-based iteration. Did you have a
replacement in mind?
{quote}
why ProjectRecordBatch and FilterRecordBatch being used? Nothing is being
projected nor filtered
{quote}
We don't explicitly elide them when the projection or filter is trivial. I
could try to benchmark whether there is a significant performance benefit to
adding a special case for trivial projection/filtering, but I'd guess we don't
gain anything.
Another potential bandaid fix would be to allow column level parallelism when
scanning a single file (since no thread contention would be incurred) (combined
with increasing batch size).
> [Python] Performance regression reading individual Parquet files using
> Dataset interface
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Wes McKinney
> Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})
>
>
> In [28]: pq.write_table(pa.table(df), 'test.parquet')
>
>
> In [29]: timeit pq.read_table('test.parquet')
>
>
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>
>
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)