Joris Van den Bossche created ARROW-9321:

             Summary: [C++][Dataset] Allow to "collect" statistics for 
ParquetFragment row groups if not constructed from _metadata
                 Key: ARROW-9321
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche

Right now, the statistics of the {{RowGroupInfo}} of ParquetFileFragments are 
only available when the dataset was constructed from a {{_metadata}} file:

import pandas as pd
df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)})              
# use dask to write partitioned dataset *with* _metadata file
import dask.dataframe as dd                                                     
ddf = dd.from_pandas(df, npartitions=2) 
ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow")         

import pyarrow.dataset as ds
dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", 
dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", 


In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups                

In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups              
Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>]

In [32]: 
Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}

For some applications (eg dask), one could want access to those statistics, 
even if the original dataset / fragments were not created from a {{_metadata}} 
file. This should not happen automatically since it's costly, but a method to 
trigger collecting all metadata would be useful.

cc [~rjzamora] 

This message was sent by Atlassian Jira

Reply via email to