I haven't really followed Variant development, but it's extremely reasonable for implementations to choose reasonable nesting limits (say, 64 levels).
I would point out that we already have somehow similar limits in Parquet C++ for Thrift decoding:
https://github.com/apache/arrow/blob/c1036681b099c5f9b0684a710be04bb7619e926f/cpp/src/parquet/properties.h#L105-L121 I'll add that parsing Variants is a natural target for fuzz testing. Regards Antoine. Le 14/05/2026 à 15:28, Steve Loughran a écrit :
my PR hardening variant readers, https://github.com/apache/parquet-java/pull/3562, isn't quite a security issue, though "malfformed 1k file can trigger a 1+ GB array/dictionary allocations" is still unwelcome. Nothing will blow up simply reading the file, only if they try to create a Variant() from the data. Arrow's variant code is very rigorous here; I haven't looked at the cpp code. I do think this is needed, but neelesh's performance PR should go in first as it is critical performance-wise. https://github.com/apache/parquet-java/pull/3481 If we get reviews and a merge in of #3481 then rework my one around that and then people can review mine. One open issue: Should the depth of variant nesting be restricted? My PR does, nothing else does, though spark puts a 16 MB limit on variant data size. If there are to be limits, then VariantEncoding.md should define them explicitly. I've got the invalid files to add to parquet-testing/bad_data now, if anyone wants them to play with -ask, or get my PR to create them for you. Steve
