[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Joris Van den Bossche (Jira) Tue, 08 Sep 2020 05:48:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17192188#comment-17192188
 ]


Joris Van den Bossche commented on ARROW-9924:
----------------------------------------------

There was one other issue about a performance regression (ARROW-9827), for 
which I have an open PR (fix to not parse statistics when there is no filter 
specified). Now, I tried a release build of that branch compared to master, and 
that doesn't seem to make a difference for this case.

bq. IMHO we should not continue to use the Dataset interface for reading single 
files by default until the perf regression has been eliminated. 

That came up before, and we can certainly still use the old ParquetFile reader 
if there is eg no {{filter}} specified (we shouldn't use ParquetDataset for 
this case, though, as was done before 1.0)

---

I did a quick profile (with py-spy), and it _seems_ that the dataset version 
has a bit more overhead in all kinds of iteration (it uses the 
RecordBatchReader, and not the {{FileReader::ReadTable}} which is specifically 
to read the whole parquet file at once)

> [Python] Performance regression reading individual Parquet files using 
> Dataset interface
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-9924
>                 URL: https://issues.apache.org/jira/browse/ARROW-9924
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> I haven't investigated very deeply but this seems symptomatic of a problem:
> {code}
> In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)})                  
>                                                                               
>                               
> In [28]: pq.write_table(pa.table(df), 'test.parquet')                         
>                                                                               
>                               
> In [29]: timeit pq.read_table('test.parquet')                                 
>                                                                               
>                               
> 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)        
>                                                                               
>                               
> 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

Reply via email to