[
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818370#comment-16818370
]
Martin Durant commented on ARROW-1983:
--------------------------------------
> Note that the Parquet format has three different metadata structures
No, this is incorrect, unfortunately the tern "metadata" is used with multiple
meanings here.
- All parquet files contain FileMetaData in the file footer, which may include
one or more key-value pairs, and includes other important things like the schema
- If the file contains any row-groups or references to row-groups in other
files, it will also contain ColumnMetaData (and possible more key-value pairs);
this is all *within* the FileMetaData structure
- the special file `_metadata` may exist, which contains *only* FileMetaData,
and any row-groups have only links to other files and no data within the file.
- the special file `_common_metadata` may exist, which also only contains a
FileMetaData structure, but has no row group components at all.
- ordinary data files should have the same common metadata (schema,
key-values), so you can load any one of them, but they contain only the
row-groups of that one file and no links to any others.
> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Jim Crist
> Priority: Major
> Labels: beginner, parquet
> Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file
> (mostly just schema information). It would be useful to add the ability to
> write a {{_metadata}} file as well. This should include information about
> each row group in the dataset, including summary statistics. Having this
> summary file would allow filtering of row groups without needing to access
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list
> to new function that then passes them on as C++ objects to {{parquet-cpp}}
> that generates the respective {{_metadata}} file.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)