[ https://issues.apache.org/jira/browse/ARROW-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Francois Saint-Jacques reassigned ARROW-8062: --------------------------------------------- Assignee: Francois Saint-Jacques > [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file > ----------------------------------------------------------------------------- > > Key: ARROW-8062 > URL: https://issues.apache.org/jira/browse/ARROW-8062 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset, Python > Reporter: Joris Van den Bossche > Assignee: Francois Saint-Jacques > Priority: Major > > Partitioned parquet datasets sometimes come with {{_metadata}} / > {{_common_metadata}} files. Those files include information about the schema > of the full dataset and potentially all RowGroup metadata as well (for > {{_metadata}}). > Using those files during the creation of a parquet {{Dataset}} can give a > more efficient factory (using the stored schema instead of inferring the > schema from unioning the schemas of all files + using the paths to individual > parquet files instead of crawling the directory). > Basically, based those files, the schema, list of paths and partition > expressions (the information that is needed to create a Dataset) could be > constructed. > Such logic could be put in a different factory class, eg > {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]). -- This message was sent by Atlassian Jira (v8.3.4#803005)