[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Wes McKinney (Jira) Sat, 12 Sep 2020 12:49:18 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194834#comment-17194834
 ]


Wes McKinney commented on ARROW-9924:
-------------------------------------

I took a look into this since I was curious what's wrong.

So I don't know about this:

{code}
In [10]: a = pq.read_table('test.parquet', use_legacy_dataset=True)

In [11]: b = pq.read_table('test.parquet', use_legacy_dataset=False)

In [12]: a[0].num_chunks
Out[12]: 1

In [13]: b[0].num_chunks
Out[13]: 306
{code}

Looking at the top of the hierarchical perf report for the "new" code, the 
deeply nested layers of iterators strikes me as one thing to think more about 
whether that's the design we want

https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8

I think the Datasets API may need to make a wiser decision about how to read a 
file based on the declared intent of the user. If the user calls {{ToTable}}, 
then I don't think it makes sense to break the problem up into so many small 
tasks -- perhaps the default chunk size should be larger than it is (so that 
streaming readers who are concerned about memory use can shrink the chunksize 
to something smaller)? 

Another question: why ProjectRecordBatch and FilterRecordBatch being used? 
Nothing is being projected nor filtered. 

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-9924
>                 URL: https://issues.apache.org/jira/browse/ARROW-9924
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})                  
>                                                                               
>                               
> In [28]: pq.write_table(pa.table(df), 'test.parquet')                         
>                                                                               
>                               
> In [29]: timeit pq.read_table('test.parquet')                                 
>                                                                               
>                               
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)        
>                                                                               
>                               
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Reply via email to