aihuaxu commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1854863959
########## VariantEncoding.md: ########## @@ -39,13 +39,41 @@ Another motivation for the representation is that (aside from metadata) each nes For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant. This document describes the Variant Binary Encoding scheme. -[VariantShredding.md](VariantShredding.md) describes the details of the Variant shredding scheme. +The [Variant Shredding specification](VariantShredding.md) describes the details of shredding Variant values as typed Parquet columns. + +## Variant in Parquet -# Variant in Parquet A Variant value in Parquet is represented by a group with 2 fields, named `value` and `metadata`. -Both fields `value` and `metadata` are of type `binary`, and cannot be `null`. -# Metadata encoding +* The Variant group must be annotated with the `VARIANT` logical type. Review Comment: When Spark implements the Variant, internally it limits value and metadata to 16MB so it will have the limit on how much data can be stored. Different engines may have different limits internally and some Variant values may not be loaded across the engines. I'm wondering if we should have such limit in the spec explicitly. But of course it's really hard to really define such kind of limit since the input data of same size may be encoded differently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
