[
https://issues.apache.org/jira/browse/PARQUET-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280465#comment-16280465
]
Nandor Kollar commented on PARQUET-899:
---------------------------------------
[~zi] isn't {{org.apache.parquet.hadoop.metadata.FileMetaData#createdBy}} used
for this purpose? Also, this class has a {{Map<String, String>
keyValueMetaData}} variable too, applications can store arbitrary metadata
there (like version number).
> Add metadata field describing the application that wrote the file
> -----------------------------------------------------------------
>
> Key: PARQUET-899
> URL: https://issues.apache.org/jira/browse/PARQUET-899
> Project: Parquet
> Issue Type: Improvement
> Reporter: Zoltan Ivanfi
>
> Although the Parquet library should behave the same regardless of what
> application uses it, occasionally serious interoperability bugs are
> introduced in specific applications. For example, data written by a specific
> application may be unnecessarily adjusted or the calculated statistics may be
> invalid (both actual problems).
> Unfortunately, currently it is not possible to recognize Parquet files
> affected by application problems because the metadata does not contain any
> information about the application using the Parquet library. (The name and
> version number of the Parquet library is recorded, but that only has limited
> use, because apart from Impala, the most widespread Parquet writers all use
> the same Java library.)
> To allow creating workarounds for future known issues, we should introduce
> new metadata fields that applications can populate. The simplest approach is
> to have one field for the application name and another for its version
> number. A more sophisticated approach suggested by [~julienledem] could also
> reference a list of earlier issues that are known to be fixed in the
> application that wrote the Parquet file.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)