[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Ben Kietzman (Jira) Mon, 14 Sep 2020 10:24:19 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195633#comment-17195633
 ]


Ben Kietzman commented on ARROW-9924:
-------------------------------------

{quote}
Looking at the top of the hierarchical perf report for the "new" code, the 
deeply nested layers of iterators strikes me as one thing to think more about 
whether that's the design we want
{quote}

To be clear, is the concern over clarity or performance? IIUC 
[https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8#file-gistfile1-txt-L8-L20]
 represents minimal cost since 0.65% of runtime was spent managing the Iterator 
abstraction. If we wanted to replace our abstraction for lazy sequences we 
could potentially refactor to a {{Future<T>}}-based iteration. Did you have a 
replacement in mind?

{quote}
why ProjectRecordBatch and FilterRecordBatch being used? Nothing is being 
projected nor filtered
{quote}

We don't explicitly elide them when the projection or filter is trivial. I 
could try to benchmark whether there is a significant performance benefit to 
adding a special case for trivial projection/filtering, but I'd guess we don't 
gain anything.

Another potential bandaid fix would be to allow column level parallelism when 
scanning a single file (since no thread contention would be incurred) (combined 
with increasing batch size).

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-9924
>                 URL: https://issues.apache.org/jira/browse/ARROW-9924
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})                  
>                                                                               
>                               
> In [28]: pq.write_table(pa.table(df), 'test.parquet')                         
>                                                                               
>                               
> In [29]: timeit pq.read_table('test.parquet')                                 
>                                                                               
>                               
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)        
>                                                                               
>                               
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Reply via email to