Hello,
I've been reading the spec in more detail here: https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md#encoding-types and I think that it should have a Security section listing potential security issues with this format (especially for readers). Given that Parquet is frequently used to make data publicly available online, it is important for implementers to know of potential issues to look for, and ideally protect against. One specific concern is the following snippet about the Object encoding: "The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary. However, the actual value entries do not need to be in any particular order. This implies that the field_offset values may not be monotonically increasing." Having field offsets which are not monotonically increasing makes it difficult to verify that the encoded values do not overlap. In general, it's useful for data formats to enable easy validation and error report. In this particular case, an attacker could perhaps craft a malicious Variant with deeply nested overlapping values to achieve a denial of service attack, similar to https://en.wikipedia.org/wiki/Billion_laughs_attack (I'm not saying such a malicious Variant is practically doable given specifics of the binary encoding, but it will be difficult to prove that it isn't) Regards Antoine.