One problem of Parquet user-defined key/value metadata is that, when merging footers of multiple Parquet files to generate the summary files, if two Parquet files have key/value entries with the same key but different values, Parquet doesn't know how to merge them, and simply throws an exception and gives up writing the summary file. If you're appending new data into an existing directory with old summary files, you may end up with stale summary files since the old ones are not properly overwritten.

This can be a problem in the case of schema evolution. For example, Spark SQL writes JSON-ized schema strings to Parquet files as key/value metadata. When appending new Parquet files into an existing directory containing existing files with different but compatible schemata, summary files can't be properly generated.

But in practice this isn't a big problem since Parquet summary files are not that important nowadays.


Cheng


On 6/16/16 1:19 PM, Wes McKinney wrote:
To add a one bit of context, we're looking at the handling of integers
other than INT32 and INT64 from the perspective of Apache Arrow. It
seems that in Parquet 1 files, you may not be able to recover the
original integer types from the file alone. The question is, should we
put this metadata in the Parquet file? See
https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67

If it may cause problems, we can leave the physical storage type as is
and leave users to explicitly cast on deserialization to another
integer type.

Thanks,
Wes

On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <[email protected]> wrote:
Hello,

I'm currently looking at the differences between Parquet 1 and Parquet 2 to
implement these versions as a switch in parquet-cpp. The only list I could
find is the rather undetailed changelog [1]. Is there maybe some better list
or do I need to go through the referenced changesets entries myself to find
the actual differences? (If the latter is the case, I'd also make a PR
afterwards that augments the documentation with some "(since version 2.0)"
markings. But I'm hoping a bit that there is some blog post or so out there
that could make my life easier.

Thanks,

Uwe

[1] https://github.com/apache/parquet-format/blob/master/CHANGES.md


Reply via email to