[jira] [Updated] (ARROW-9730) [C++][Dataset] Parsing statistics of Parquet FileMetadata is expensive

Joris Van den Bossche (Jira) Thu, 13 Aug 2020 11:44:56 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-9730:
-----------------------------------------
    Description: 
>From a discussion in dask 
>(https://github.com/dask/dask/pull/6346/#issuecomment-656548675), we noticed 
>that parsing all the statistics of a larger dataset is quite time consuming.

Now, it might be that this is already optimized and one simply needs to live 
with the cost of parsing statistics if you want the benefit of those statistics 
for row group filtering. But, it might be worth profiling this to ensure there 
is not actually some performance bug / low hanging fruit lying around.

*Example timing:*

I was testing locally with a part of the NYC taxi data (for 2.5 years (2016-07 
- end 2018), one file per month, total disk size of 4.3 GB):

{code:python}
>>> import pyarrow.dataset as ds 
>>> dataset = ds.dataset("notebooks-arrow/nyc-taxi-data/original-partitioned/", 
>>> format="parquet", partitioning=["year", "month"]) 
>>> fragments = list(dataset.get_fragments()) 
>>> len(fragments) 
30
>>> %time [frag.ensure_complete_metadata() for frag in fragments] 
{code}

Timing results of the last line of master vs commenting out parsing statistics 
when collecting the metadata:

{code:python}
In [5]: %time [frag.ensure_complete_metadata() for frag in fragments] 
# master
CPU times: user 4.22 s, sys: 75.4 ms, total: 4.3 s
Wall time: 4.41 s

# master but with parsing statistics commented out (still reading the 
FileMetadata and row group information for num_rows, total_byte_size)
CPU times: user 377 ms, sys: 4.47 ms, total: 381 ms
Wall time: 404 ms

{code}

  was:
>From a discussion in dask 
>(https://github.com/dask/dask/pull/6346/#issuecomment-656548675), we noticed 
>that parsing all the statistics of a larger dataset is quite time consuming.

Now, it might be that this is already optimized and one simply needs to live 
with the cost of parsing statistics if you want the benefit of those statistics 
for row group filtering. But, it might be worth profiling this to ensure there 
is not actually some performance bug / low hanging fruit lying around.

*Example timing:*

I was testing locally with a part of the NYC taxi data (for 2.5 years (2016-07 
- end 2018), one file per month, total disk size of 4.3 GB):

{code:python}
>>> import pyarrow.dataset as ds 
>>> dataset = ds.dataset("notebooks-arrow/nyc-taxi-data/original-partitioned/", 
>>> format="parquet", partitioning=["year", "month"]) 
>>> fragments = list(dataset.get_fragments()) 
>>> len(fragments) 
30
>>> %time [frag.ensure_complete_metadata() for frag in fragments] 
{code}

Timing results of the last line of master vs commenting out parsing statistics 
/ num_rows when collecting the metadata:

{code:python}
In [5]: %time [frag.ensure_complete_metadata() for frag in fragments] 
# master
CPU times: user 4.22 s, sys: 75.4 ms, total: 4.3 s
Wall time: 4.41 s

# master but with parsing statistics commented out (still reading the 
FileMetadata and row group information for num_rows, total_byte_size)
CPU times: user 377 ms, sys: 4.47 ms, total: 381 ms
Wall time: 404 ms

{code}


> [C++][Dataset] Parsing statistics of Parquet FileMetadata is expensive
> ----------------------------------------------------------------------
>
>                 Key: ARROW-9730
>                 URL: https://issues.apache.org/jira/browse/ARROW-9730
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> From a discussion in dask 
> (https://github.com/dask/dask/pull/6346/#issuecomment-656548675), we noticed 
> that parsing all the statistics of a larger dataset is quite time consuming.
> Now, it might be that this is already optimized and one simply needs to live 
> with the cost of parsing statistics if you want the benefit of those 
> statistics for row group filtering. But, it might be worth profiling this to 
> ensure there is not actually some performance bug / low hanging fruit lying 
> around.
> *Example timing:*
> I was testing locally with a part of the NYC taxi data (for 2.5 years 
> (2016-07 - end 2018), one file per month, total disk size of 4.3 GB):
> {code:python}
> >>> import pyarrow.dataset as ds 
> >>> dataset = 
> >>> ds.dataset("notebooks-arrow/nyc-taxi-data/original-partitioned/", 
> >>> format="parquet", partitioning=["year", "month"]) 
> >>> fragments = list(dataset.get_fragments()) 
> >>> len(fragments) 
> 30
> >>> %time [frag.ensure_complete_metadata() for frag in fragments] 
> {code}
> Timing results of the last line of master vs commenting out parsing 
> statistics when collecting the metadata:
> {code:python}
> In [5]: %time [frag.ensure_complete_metadata() for frag in fragments] 
> # master
> CPU times: user 4.22 s, sys: 75.4 ms, total: 4.3 s
> Wall time: 4.41 s
> # master but with parsing statistics commented out (still reading the 
> FileMetadata and row group information for num_rows, total_byte_size)
> CPU times: user 377 ms, sys: 4.47 ms, total: 381 ms
> Wall time: 404 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9730) [C++][Dataset] Parsing statistics of Parquet FileMetadata is expensive

Reply via email to