Joris Van den Bossche created ARROW-8062:
--------------------------------------------
Summary: [C++][Dataset] Parquet Dataset factory from a
_metadata/_common_metadata file
Key: ARROW-8062
URL: https://issues.apache.org/jira/browse/ARROW-8062
Project: Apache Arrow
Issue Type: Improvement
Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche
Partitioned parquet datasets sometimes come with {{_metadata}} /
{{_common_metadata}} files. Those files include information about the schema of
the full dataset and potentially all RowGroup metadata as well (for
{{_metadata}}).
Using those files during the creation of a parquet {{Dataset}} can give a more
efficient factory (using the stored schema instead of inferring the schema from
unioning the schemas of all files + using the paths to individual parquet files
instead of crawling the directory).
Basically, based those files, the schema, list of paths and partition
expressions (the information that is needed to create a Dataset) could be
constructed.
Such logic could be put in a different factory class, eg
{{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)