[
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192188#comment-17192188
]
Joris Van den Bossche commented on ARROW-9924:
----------------------------------------------
There was one other issue about a performance regression (ARROW-9827), for
which I have an open PR (fix to not parse statistics when there is no filter
specified). Now, I tried a release build of that branch compared to master, and
that doesn't seem to make a difference for this case.
bq. IMHO we should not continue to use the Dataset interface for reading single
files by default until the perf regression has been eliminated.
That came up before, and we can certainly still use the old ParquetFile reader
if there is eg no {{filter}} specified (we shouldn't use ParquetDataset for
this case, though, as was done before 1.0)
---
I did a quick profile (with py-spy), and it _seems_ that the dataset version
has a bit more overhead in all kinds of iteration (it uses the
RecordBatchReader, and not the {{FileReader::ReadTable}} which is specifically
to read the whole parquet file at once)
> [Python] Performance regression reading individual Parquet files using
> Dataset interface
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Wes McKinney
> Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})
>
>
> In [28]: pq.write_table(pa.table(df), 'test.parquet')
>
>
> In [29]: timeit pq.read_table('test.parquet')
>
>
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>
>
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)