wombatu-kun opened a new pull request, #16500: URL: https://github.com/apache/iceberg/pull/16500
Closes #16475 Hostile or corrupted manifest metadata can claim a negative or near-2 GiB `content_offset` / `content_size_in_bytes` on a deletion-vector `DeleteFile`. The existing precondition in `BaseDeleteLoader.validateDV` only checks non-null and `size <= Integer.MAX_VALUE`, so a negative size slips through to `new byte[length]` and throws `NegativeArraySizeException`; a negative offset reaches `IOUtil.readFully` and produces an invalid seek. The same unguarded allocation pattern lives in `DVUtil.readDV` on the rewrite / DV-merge path used by `mergeAndWriteDVsIfRequired`. ### Changes - Add `ContentFileUtil.validateDV` as the canonical read-path validator (null offset, null size, `offset >= 0`, `size >= 0`, `size <= Integer.MAX_VALUE`). - Have both `BaseDeleteLoader.validateDV` (data) and `DVUtil.readDV` (core) delegate to it. `BaseDeleteLoader` keeps its loader-specific `referencedDataFile` check. - For defense-in-depth, also reject negative `contentOffset` / `contentSizeInBytes` in `FileMetadata.Builder.build()` for `PUFFIN`, mirroring the existing `fileSizeInBytes >= 0` / `recordCount >= 0` checks at the same site. Producers using the builder can no longer accidentally emit bad metadata; the read-side check still defends against corrupted on-disk manifests where `BaseFile.internalSet` hydrates fields directly via Avro reflection. ### Tests - `TestContentFileUtil` (new) — covers each precondition of the new validator (null offset/size, negative offset/size, oversized size, valid zero / typical / `Integer.MAX_VALUE` values). - `TestFileMetadata` (new) — verifies the builder rejects negative DV offset/size and still accepts valid values. - `TestBaseDeleteLoader` (new, in package `org.apache.iceberg` so it can reach the package-private `Delegates.DelegatingDeleteFile`) — constructs a tampered `DeleteFile` that bypasses the builder and asserts `BaseDeleteLoader.loadPositionDeletes` throws before any I/O is attempted. - Existing `TestRowDelta` (324 tests), `TestDeletionVectorStruct`, `TestContentFileParser`, `TestDeleteFiles` regression suites continue to pass. ### Out of scope `PuffinReader` / `BlobMetadata` similarly don't range-check blob offsets and lengths today, but Puffin is also used for non-DV blobs (statistics) and "valid" at that layer is a separate discussion of footer-authoritative vs. caller-supplied bounds. Leaving Puffin-level hardening as a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
