[jira] [Updated] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

ASF GitHub Bot (Jira) Thu, 14 May 2020 11:36:28 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated ARROW-8062:
----------------------------------
    Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-8062
>                 URL: https://issues.apache.org/jira/browse/ARROW-8062
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Assignee: Francois Saint-Jacques
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Partitioned parquet datasets sometimes come with {{_metadata}} / 
> {{_common_metadata}} files. Those files include information about the schema 
> of the full dataset and potentially all RowGroup metadata as well (for 
> {{_metadata}}).
> Using those files during the creation of a parquet {{Dataset}} can give a 
> more efficient factory (using the stored schema instead of inferring the 
> schema from unioning the schemas of all files + using the paths to individual 
> parquet files instead of crawling the directory).
> Basically, based those files, the schema, list of paths and partition 
> expressions (the information that is needed to create a Dataset) could be 
> constructed.   
> Such logic could be put in a different factory class, eg 
> {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

Reply via email to