Joris Van den Bossche created ARROW-9321:
--------------------------------------------

             Summary: [C++][Dataset] Allow to "collect" statistics for 
ParquetFragment row groups if not constructed from _metadata
                 Key: ARROW-9321
                 URL: https://issues.apache.org/jira/browse/ARROW-9321
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


Right now, the statistics of the {{RowGroupInfo}} of ParquetFileFragments are 
only available when the dataset was constructed from a {{_metadata}} file:

{code:python}
import pandas as pd
df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)})              
                                                                                
                                          
# use dask to write partitioned dataset *with* _metadata file
import dask.dataframe as dd                                                     
                                                                                
                                          
ddf = dd.from_pandas(df, npartitions=2) 
ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow")         
                                                                                
                            

import pyarrow.dataset as ds
dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", 
partitioning="hive")
dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", 
partitioning="hive")                                                            
                                                     
{code}

{code}

In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups                
                                                                                
                                                   

In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups              
                                                                                
                                                   
Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>]

In [32]: 
list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics         
                                                                                
                                          
Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}
{code}

For some applications (eg dask), one could want access to those statistics, 
even if the original dataset / fragments were not created from a {{_metadata}} 
file. This should not happen automatically since it's costly, but a method to 
trigger collecting all metadata would be useful.

cc [~rjzamora] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to