wombatu-kun opened a new pull request, #16508:
URL: https://github.com/apache/iceberg/pull/16508

   Closes #16458
   
   ## Summary
   
   Five vectorized Parquet decoders in `iceberg-arrow` allocated heap arrays 
whose size came directly from the file bytes — varint headers in 
DELTA_BINARY_PACKED, the available-byte count in BYTE_STREAM_SPLIT, 
packed-group counts in the RLE/PACKED reader, and prefix-length entries in 
DELTA_BYTE_ARRAY. A crafted page could force a multi-hundred-MB allocation 
before any row-batch or page-size limit applied. Maintainers downgraded the 
report from security to a normal hardening task and invited a fix.
   
   This PR adds tight bound checks before each allocation, so a malformed page 
now fails with a descriptive `IllegalArgumentException` instead of allocating 
gigabytes of heap. Bounds come from each encoding's own math, not from 
arbitrary tuning knobs, so spec-conformant pages are unaffected.
   
   ## What changed
   
   `BaseVectorizedParquetValuesReader`: PACKED group count fits in `int` 
(avoids `numGroups * 8` overflow) and the run cannot claim more bytes than the 
page still has.
   
   `VectorizedDeltaEncodedValuesReader`: `blockSizeInValues` is in `(0, 1 << 
20]` — the upper bound is ~8000× the parquet-mr default of 128, well clear of 
any legitimate writer. `miniBlocksPerBlock` is in `(0, blockSize]`. 
`totalValueCount` is in `[0, valueCount]` (it is the file-declared non-null 
count, which can never exceed the caller's total).
   
   `VectorizedByteStreamSplitValuesReader`: page size equals `valueCount * 
elementSizeInBytes` exactly (the encoding is byte-exact). The 
previously-ignored `valueCount` parameter is now used for validation. 
`readBinary(int len)` enforces `0 <= len <= decodedDataStream.remaining()` 
before allocating.
   
   `VectorizedDeltaByteArrayValuesReader`: prefix length fits in the previous 
value (spec invariant), total length is at least the prefix length, and the 
combined `prefix + suffix` length does not overflow `int`.
   
   `VectorizedDeltaLengthByteArrayValuesReader`: `readBinary(int len)` enforces 
`0 <= len <= dataStream.available()` before allocating.
   
   ## Tests
   
   One new test class per modified reader (`Test*ValuesReader`), plus an 
abstract `VectorizedParquetDecoderTestBase` that owns the shared forged-page 
helpers (`page`, `concat`). Each test class covers every new guard via a 
hand-built malformed page, plus a positive round-trip to confirm 
spec-conformant pages still decode correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to