[ 
https://issues.apache.org/jira/browse/ARROW-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-9730.
----------------------------------------
    Resolution: Not A Problem

> [C++][Dataset] Parsing statistics of Parquet FileMetadata is expensive
> ----------------------------------------------------------------------
>
>                 Key: ARROW-9730
>                 URL: https://issues.apache.org/jira/browse/ARROW-9730
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> From a discussion in dask 
> (https://github.com/dask/dask/pull/6346/#issuecomment-656548675), we noticed 
> that parsing all the statistics of a larger dataset is quite time consuming.
> Now, it might be that this is already optimized and one simply needs to live 
> with the cost of parsing statistics if you want the benefit of those 
> statistics for row group filtering. But, it might be worth profiling this to 
> ensure there is not actually some performance bug / low hanging fruit lying 
> around.
> *Example timing:*
> I was testing locally with a part of the NYC taxi data (for 2.5 years 
> (2016-07 - end 2018), one file per month, total disk size of 4.3 GB):
> {code:python}
> >>> import pyarrow.dataset as ds 
> >>> dataset = 
> >>> ds.dataset("notebooks-arrow/nyc-taxi-data/original-partitioned/", 
> >>> format="parquet", partitioning=["year", "month"]) 
> >>> fragments = list(dataset.get_fragments()) 
> >>> len(fragments) 
> 30
> >>> %time [frag.ensure_complete_metadata() for frag in fragments] 
> {code}
> Timing results of the last line of master vs commenting out parsing 
> statistics when collecting the metadata:
> {code:python}
> In [5]: %time [frag.ensure_complete_metadata() for frag in fragments] 
> # master
> CPU times: user 4.22 s, sys: 75.4 ms, total: 4.3 s
> Wall time: 4.41 s
> # master but with parsing statistics commented out (still reading the 
> FileMetadata and row group information for num_rows, total_byte_size)
> CPU times: user 377 ms, sys: 4.47 ms, total: 381 ms
> Wall time: 404 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to