[
https://issues.apache.org/jira/browse/ARROW-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428725#comment-17428725
]
Joris Van den Bossche commented on ARROW-14303:
-----------------------------------------------
The mailing list question that sparked this issue mentioned "this duplication
can significantly increase the size of the file when there is a large amount of
metadata stored" (but I don't know how common it is for having such large
metadata that this becomes an issue)
> [C++][Parquet] Do not duplicate Schema metadata in Parquet schema metadata
> and serialized ARROW:schema value
> ------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14303
> URL: https://issues.apache.org/jira/browse/ARROW-14303
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 7.0.0
>
>
> Metadata values are being duplicated in the Parquet file footer — we should
> either only store them in ARROW:schema or the Parquet schema metadata.
> Removing them from the Parquet schema metadata may break applications that
> are expecting that metadata to be there when serialized from Arrow, so
> dropping the keys from ARROW:schema is probably a safer choice
--
This message was sent by Atlassian Jira
(v8.3.4#803005)