[jira] [Created] (ARROW-10131) [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment

Joris Van den Bossche (Jira) Tue, 29 Sep 2020 05:33:48 -0700

Joris Van den Bossche created ARROW-10131:
---------------------------------------------


             Summary: [C++][Dataset] Lazily parse parquet metadata / statistics 
in ParquetDatasetFactory and ParquetFileFragment
                 Key: ARROW-10131
                 URL: https://issues.apache.org/jira/browse/ARROW-10131
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


Related to ARROW-9730, parsing of the statistics in parquet metadata is 
expensive, and therefore should be avoided when possible.

For example, the {{ParquetDatasetFactory}} ({{ds.parquet_dataset()}} in python) 
parses all statistics of all files and all columns. While when doing a filtered 
read, you might only need the statistics of certain files (eg if a filter on a 
partition field already excluded many files) and certain columns (eg only the 
columns on which you are actually filtering).

The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later 
EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, 
or only parsing the other (non-statistics) metadata, ...), so I think we should 
try to think of better abstractions.

cc [~rjzamora] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10131) [C++][Dataset] Lazily parse parquet metadata / statistics in ParquetDatasetFactory and ParquetFileFragment

Reply via email to