Hello,

I've been reading the spec in more detail here:
https://github.com/apache/spark/blob/d84f1a3575c4125009374521d2f179089ebd71ad/common/variant/README.md#encoding-types

and I think that it should have a Security section listing potential
security issues with this format (especially for readers).

Given that Parquet is frequently used to make data publicly available
online, it is important for implementers to know of potential issues to
look for, and ideally protect against.


One specific concern is the following snippet about the Object encoding:

"The field ids and field offsets must be in lexicographical order of the
corresponding field names in the metadata dictionary. However, the
actual value entries do not need to be in any particular order. This
implies that the field_offset values may not be monotonically
increasing."

Having field offsets which are not monotonically increasing makes it
difficult to verify that the encoded values do not overlap. In general,
it's useful for data formats to enable easy validation and error report.
In this particular case, an attacker could perhaps craft a malicious
Variant with deeply nested overlapping values to achieve a denial of
service attack, similar to
https://en.wikipedia.org/wiki/Billion_laughs_attack

(I'm not saying such a malicious Variant is practically doable given
specifics of the binary encoding, but it will be difficult to prove
that it isn't)

Regards

Antoine.


Reply via email to