[
https://issues.apache.org/jira/browse/PARQUET-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281989#comment-16281989
]
Zoltan Ivanfi commented on PARQUET-899:
---------------------------------------
The createdBy metadata stores the name and version number of the Parquet
library, not the higher-level component that consumes the Parquet API. The
freeform key-value metadata can be used for storing any information, but it
won't be standard. I think we should add a new, dedicated field (or rather two
separate ones for identifying the writer and its version, respectively) to the
specification in parquet-format and the language bindings as well. By the
latter I mean API functions that higher-level components can call to set this
information. Another alternative would be to add this new parameter to some
already existing function that components have to call when they write Parquet
files. In this case we can even deprecate the older versions that do not accept
this information.
> Add metadata field describing the application that wrote the file
> -----------------------------------------------------------------
>
> Key: PARQUET-899
> URL: https://issues.apache.org/jira/browse/PARQUET-899
> Project: Parquet
> Issue Type: Improvement
> Reporter: Zoltan Ivanfi
>
> Although the Parquet library should behave the same regardless of what
> application uses it, occasionally serious interoperability bugs are
> introduced in specific applications. For example, data written by a specific
> application may be unnecessarily adjusted or the calculated statistics may be
> invalid (both actual problems).
> Unfortunately, currently it is not possible to recognize Parquet files
> affected by application problems because the metadata does not contain any
> information about the application using the Parquet library. (The name and
> version number of the Parquet library is recorded, but that only has limited
> use, because apart from Impala, the most widespread Parquet writers all use
> the same Java library.)
> To allow creating workarounds for future known issues, we should introduce
> new metadata fields that applications can populate. The simplest approach is
> to have one field for the application name and another for its version
> number. A more sophisticated approach suggested by [~julienledem] could also
> reference a list of earlier issues that are known to be fixed in the
> application that wrote the Parquet file.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)