Joris Van den Bossche created ARROW-9321: --------------------------------------------
Summary: [C++][Dataset] Allow to "collect" statistics for ParquetFragment row groups if not constructed from _metadata Key: ARROW-9321 URL: https://issues.apache.org/jira/browse/ARROW-9321 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Right now, the statistics of the {{RowGroupInfo}} of ParquetFileFragments are only available when the dataset was constructed from a {{_metadata}} file: {code:python} import pandas as pd df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)}) # use dask to write partitioned dataset *with* _metadata file import dask.dataframe as dd ddf = dd.from_pandas(df, npartitions=2) ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow") import pyarrow.dataset as ds dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", partitioning="hive") dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", partitioning="hive") {code} {code} In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>] In [32]: list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}} {code} For some applications (eg dask), one could want access to those statistics, even if the original dataset / fragments were not created from a {{_metadata}} file. This should not happen automatically since it's costly, but a method to trigger collecting all metadata would be useful. cc [~rjzamora] -- This message was sent by Atlassian Jira (v8.3.4#803005)