my PR hardening variant readers, https://github.com/apache/parquet-java/pull/3562, isn't quite a security issue, though "malfformed 1k file can trigger a 1+ GB array/dictionary allocations" is still unwelcome. Nothing will blow up simply reading the file, only if they try to create a Variant() from the data.
Arrow's variant code is very rigorous here; I haven't looked at the cpp code. I do think this is needed, but neelesh's performance PR should go in first as it is critical performance-wise. https://github.com/apache/parquet-java/pull/3481 If we get reviews and a merge in of #3481 then rework my one around that and then people can review mine. One open issue: Should the depth of variant nesting be restricted? My PR does, nothing else does, though spark puts a 16 MB limit on variant data size. If there are to be limits, then VariantEncoding.md should define them explicitly. I've got the invalid files to add to parquet-testing/bad_data now, if anyone wants them to play with -ask, or get my PR to create them for you. Steve
