[
https://issues.apache.org/jira/browse/ARROW-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche closed ARROW-9730.
----------------------------------------
Resolution: Not A Problem
> [C++][Dataset] Parsing statistics of Parquet FileMetadata is expensive
> ----------------------------------------------------------------------
>
> Key: ARROW-9730
> URL: https://issues.apache.org/jira/browse/ARROW-9730
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
>
> From a discussion in dask
> (https://github.com/dask/dask/pull/6346/#issuecomment-656548675), we noticed
> that parsing all the statistics of a larger dataset is quite time consuming.
> Now, it might be that this is already optimized and one simply needs to live
> with the cost of parsing statistics if you want the benefit of those
> statistics for row group filtering. But, it might be worth profiling this to
> ensure there is not actually some performance bug / low hanging fruit lying
> around.
> *Example timing:*
> I was testing locally with a part of the NYC taxi data (for 2.5 years
> (2016-07 - end 2018), one file per month, total disk size of 4.3 GB):
> {code:python}
> >>> import pyarrow.dataset as ds
> >>> dataset =
> >>> ds.dataset("notebooks-arrow/nyc-taxi-data/original-partitioned/",
> >>> format="parquet", partitioning=["year", "month"])
> >>> fragments = list(dataset.get_fragments())
> >>> len(fragments)
> 30
> >>> %time [frag.ensure_complete_metadata() for frag in fragments]
> {code}
> Timing results of the last line of master vs commenting out parsing
> statistics when collecting the metadata:
> {code:python}
> In [5]: %time [frag.ensure_complete_metadata() for frag in fragments]
> # master
> CPU times: user 4.22 s, sys: 75.4 ms, total: 4.3 s
> Wall time: 4.41 s
> # master but with parsing statistics commented out (still reading the
> FileMetadata and row group information for num_rows, total_byte_size)
> CPU times: user 377 ms, sys: 4.47 ms, total: 381 ms
> Wall time: 404 ms
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)