[
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194834#comment-17194834
]
Wes McKinney commented on ARROW-9924:
-------------------------------------
I took a look into this since I was curious what's wrong.
So I don't know about this:
{code}
In [10]: a = pq.read_table('test.parquet', use_legacy_dataset=True)
In [11]: b = pq.read_table('test.parquet', use_legacy_dataset=False)
In [12]: a[0].num_chunks
Out[12]: 1
In [13]: b[0].num_chunks
Out[13]: 306
{code}
Looking at the top of the hierarchical perf report for the "new" code, the
deeply nested layers of iterators strikes me as one thing to think more about
whether that's the design we want
https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8
I think the Datasets API may need to make a wiser decision about how to read a
file based on the declared intent of the user. If the user calls {{ToTable}},
then I don't think it makes sense to break the problem up into so many small
tasks -- perhaps the default chunk size should be larger than it is (so that
streaming readers who are concerned about memory use can shrink the chunksize
to something smaller)?
Another question: why ProjectRecordBatch and FilterRecordBatch being used?
Nothing is being projected nor filtered.
> [Python] Performance regression reading individual Parquet files using
> Dataset interface
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Wes McKinney
> Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})
>
>
> In [28]: pq.write_table(pa.table(df), 'test.parquet')
>
>
> In [29]: timeit pq.read_table('test.parquet')
>
>
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>
>
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)