I haven't really followed Variant development, but it's extremely reasonable for implementations to choose reasonable nesting limits (say, 64 levels).

I would point out that we already have somehow similar limits in Parquet C++ for Thrift decoding:
https://github.com/apache/arrow/blob/c1036681b099c5f9b0684a710be04bb7619e926f/cpp/src/parquet/properties.h#L105-L121

I'll add that parsing Variants is a natural target for fuzz testing.

Regards

Antoine.


Le 14/05/2026 à 15:28, Steve Loughran a écrit :
my PR hardening variant readers,
https://github.com/apache/parquet-java/pull/3562, isn't quite a security
issue, though "malfformed 1k file can trigger a 1+ GB array/dictionary
allocations" is still unwelcome. Nothing will blow up simply reading the
file, only if they try to create a Variant() from the data.

Arrow's variant code is very rigorous here; I haven't looked at the cpp
code.

I do think this is needed, but neelesh's performance PR should go in first
as it is critical performance-wise.

https://github.com/apache/parquet-java/pull/3481

If we get reviews and a merge in of #3481 then rework my one around that
and then people can review mine. One open issue: Should the depth of
variant nesting be restricted? My PR does, nothing else does, though spark
puts a 16 MB limit on variant data size. If there are to be limits, then
VariantEncoding.md should define them explicitly.

I've got the invalid files to add to parquet-testing/bad_data now, if
anyone wants them to play with -ask, or get my PR to create them for you.

Steve



Reply via email to