[ 
https://issues.apache.org/jira/browse/ARROW-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-10130:
---------------------------------------------

    Assignee: Joris Van den Bossche

> [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve 
> "complete_metadata" status
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10130
>                 URL: https://issues.apache.org/jira/browse/ARROW-10130
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Splitting a ParquetFileFragment in  multiple fragments per row group 
> ({{SplitByRowGroup}}) calls {{EnsureCompleteMetadata}} initially, but then 
> the created fragments afterwards don't have the {{has_complete_metadata_}} 
> property set. This means that when calling {{EnsureCompleteMetadata}} on the 
> splitted fragments, it will read/parse the metadata again, instead of using 
> the cached ones (which are already present).
> Small example to illustrate:
> {code:python}
> In [1]: import pyarrow.dataset as ds
> In [2]: dataset = 
> ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", 
> partitioning="hive")
> In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in 
> frag.split_by_row_group()]
> In [4]: len(rg_fragments)
> Out[4]: 4520
> # row group fragments actually have statistics
> In [7]: rg_fragments[0].row_groups[0].statistics
> Out[7]: 
> {'vendor_id': {'min': '1', 'max': '4'},
>  'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51),
>   'max': datetime.datetime(2018, 12, 26, 14, 48, 54)},
> ...
> # but calling ensure_complete_metadata still takes a lot of time the first 
> call
> In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
> CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s
> Wall time: 1.9 s
> In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
> CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms
> Wall time: 1.35 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to