my PR hardening variant readers,
https://github.com/apache/parquet-java/pull/3562, isn't quite a security
issue, though "malfformed 1k file can trigger a 1+ GB array/dictionary
allocations" is still unwelcome. Nothing will blow up simply reading the
file, only if they try to create a Variant() from the data.

Arrow's variant code is very rigorous here; I haven't looked at the cpp
code.

I do think this is needed, but neelesh's performance PR should go in first
as it is critical performance-wise.

https://github.com/apache/parquet-java/pull/3481

If we get reviews and a merge in of #3481 then rework my one around that
and then people can review mine. One open issue: Should the depth of
variant nesting be restricted? My PR does, nothing else does, though spark
puts a 16 MB limit on variant data size. If there are to be limits, then
VariantEncoding.md should define them explicitly.

I've got the invalid files to add to parquet-testing/bad_data now, if
anyone wants them to play with -ask, or get my PR to create them for you.

Steve

Reply via email to