[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Wes McKinney (Jira) Sat, 12 Sep 2020 13:39:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194853#comment-17194853
 ]


Wes McKinney commented on ARROW-9924:
-------------------------------------

Think I found the problem. I expanded the chunk size to 10M so there is a 
single chunk in both cases and:

{code}
In [1]: %time a = pq.read_table('test.parquet', use_legacy_dataset=False)       
                                                                  
CPU times: user 1.5 s, sys: 2.59 s, total: 4.08 s
Wall time: 4.09 s

In [2]: %time a = pq.read_table('test.parquet', use_legacy_dataset=True)        
                                                                  
CPU times: user 3.49 s, sys: 5.28 s, total: 8.77 s
Wall time: 1.64 s
{code}

Digging deeper, another problem is that column decoding is not being 
parallelized when using the Datasets API, whereas it is when you use 
{{FileReader::ReadTable}}. This is likely an artifact of the fact that we have 
not yet tackled the nested parallelism problem in the Datasets API. It's too 
bad that our users are now suffering the consequences of this.

So there are a two problems here:

* 32K is too small of a default batch size for quickly reading files into 
memory. I suggest setting it to ~256K or ~1M rows per batch
* Parquet row group deserialization is not being parallelized at the column 
level in {{parquet::arrow::FileReader::GetRecordBatchReader}}

The band-aid solution to this problem will be to use the old code path when no 
special Datasets features are needed when using {{parquet.read_table}}, but 
these two issues do need to be fixed. 

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-9924
>                 URL: https://issues.apache.org/jira/browse/ARROW-9924
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})                  
>                                                                               
>                               
> In [28]: pq.write_table(pa.table(df), 'test.parquet')                         
>                                                                               
>                               
> In [29]: timeit pq.read_table('test.parquet')                                 
>                                                                               
>                               
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)        
>                                                                               
>                               
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Reply via email to