[
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194853#comment-17194853
]
Wes McKinney commented on ARROW-9924:
-------------------------------------
Think I found the problem. I expanded the chunk size to 10M so there is a
single chunk in both cases and:
{code}
In [1]: %time a = pq.read_table('test.parquet', use_legacy_dataset=False)
CPU times: user 1.5 s, sys: 2.59 s, total: 4.08 s
Wall time: 4.09 s
In [2]: %time a = pq.read_table('test.parquet', use_legacy_dataset=True)
CPU times: user 3.49 s, sys: 5.28 s, total: 8.77 s
Wall time: 1.64 s
{code}
Digging deeper, another problem is that column decoding is not being
parallelized when using the Datasets API, whereas it is when you use
{{FileReader::ReadTable}}. This is likely an artifact of the fact that we have
not yet tackled the nested parallelism problem in the Datasets API. It's too
bad that our users are now suffering the consequences of this.
So there are a two problems here:
* 32K is too small of a default batch size for quickly reading files into
memory. I suggest setting it to ~256K or ~1M rows per batch
* Parquet row group deserialization is not being parallelized at the column
level in {{parquet::arrow::FileReader::GetRecordBatchReader}}
The band-aid solution to this problem will be to use the old code path when no
special Datasets features are needed when using {{parquet.read_table}}, but
these two issues do need to be fixed.
> [Python] Performance regression reading individual Parquet files using
> Dataset interface
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-9924
> URL: https://issues.apache.org/jira/browse/ARROW-9924
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Wes McKinney
> Priority: Critical
> Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})
>
>
> In [28]: pq.write_table(pa.table(df), 'test.parquet')
>
>
> In [29]: timeit pq.read_table('test.parquet')
>
>
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)
>
>
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)