Re: [PR] API: Harden variant binary parsing against malformed input [iceberg]

via GitHub Tue, 16 Jun 2026 06:49:37 -0700


laskoviymishka commented on code in PR #16568:
URL: https://github.com/apache/iceberg/pull/16568#discussion_r3421232920



##########
api/src/main/java/org/apache/iceberg/variants/VariantUtil.java:
##########
@@ -30,8 +32,38 @@ class VariantUtil {
   private static final int BASIC_TYPE_OBJECT = 2;
   private static final int BASIC_TYPE_ARRAY = 3;
 
+  /**
+   * Maximum permitted nesting depth of a Variant value. The top-level value 
is depth 0, so a
+   * Variant may contain up to {@code MAX_VARIANT_DEPTH} nested levels.
+   */
+  static final int MAX_VARIANT_DEPTH = 500;
+
   private VariantUtil() {}
 
+  /** Parses a variant value from {@code value} using {@code metadata} for 
field-name resolution. */
+  static VariantValue fromBuffer(VariantMetadata metadata, ByteBuffer value, 
int depth) {
+    Preconditions.checkArgument(
+        depth <= MAX_VARIANT_DEPTH,
+        "Invalid variant: nesting depth %s exceeds maximum %s",

Review Comment:
   The Javadoc on `MAX_VARIANT_DEPTH` says a variant "may contain up to 
MAX_VARIANT_DEPTH nested levels," but `depth <= MAX_VARIANT_DEPTH` with the 
top-level at depth 0 actually admits 501 levels (depths 0 through 500). It's 
internally consistent, just off-by-one against the comment.
   
   I'd either flip this to `depth < MAX_VARIANT_DEPTH` or update the Javadoc to 
state the counting model exactly. A `depth >= 0` lower bound here would also 
close the direct-call path where `depth + 1` could overflow past the cap.



##########
api/src/main/java/org/apache/iceberg/variants/SerializedArray.java:
##########
@@ -35,29 +35,46 @@ static SerializedArray from(VariantMetadata metadata, 
byte[] bytes) {
     return from(metadata, 
ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN), bytes[0]);
   }
 
+  @VisibleForTesting
   static SerializedArray from(VariantMetadata metadata, ByteBuffer value, int 
header) {
+    return from(metadata, value, header, 0);
+  }
+
+  static SerializedArray from(VariantMetadata metadata, ByteBuffer value, int 
header, int depth) {
     Preconditions.checkArgument(
         value.order() == ByteOrder.LITTLE_ENDIAN, "Unsupported byte order: big 
endian");
     BasicType basicType = VariantUtil.basicType(header);
     Preconditions.checkArgument(
         basicType == BasicType.ARRAY, "Invalid array, basic type: " + 
basicType);
-    return new SerializedArray(metadata, value, header);
+    return new SerializedArray(metadata, value, header, depth);
   }
 
   private final VariantMetadata metadata;
   private final ByteBuffer value;
   private final int offsetSize;
   private final int offsetListOffset;
   private final int dataOffset;
+  private final int depth;
   private final VariantValue[] array;
 
-  private SerializedArray(VariantMetadata metadata, ByteBuffer value, int 
header) {
+  private SerializedArray(VariantMetadata metadata, ByteBuffer value, int 
header, int depth) {
     this.metadata = metadata;
     this.value = value;
+    this.depth = depth;
     this.offsetSize = 1 + ((header & OFFSET_SIZE_MASK) >> OFFSET_SIZE_SHIFT);
     int numElementsSize = ((header & IS_LARGE) == IS_LARGE) ? 4 : 1;
+    Preconditions.checkArgument(
+        value.remaining() >= HEADER_SIZE + numElementsSize,
+        "Invalid variant array: buffer too small for element count field");
     int numElements = ByteBuffers.readLittleEndianUnsigned(value, HEADER_SIZE, 
numElementsSize);
+    Preconditions.checkArgument(
+        numElements >= 0, "Invalid variant array: negative element count %s", 
numElements);
     this.offsetListOffset = HEADER_SIZE + numElementsSize;
+    long offsetTableEnd = (long) offsetListOffset + ((long) numElements + 1L) 
* offsetSize;
+    Preconditions.checkArgument(
+        offsetTableEnd <= value.remaining(),
+        "Invalid variant array: element count %s exceeds buffer",
+        numElements);
     this.dataOffset = offsetListOffset + ((1 + numElements) * offsetSize);

Review Comment:
   `dataOffset` here is assigned with plain-`int` arithmetic, but we just 
computed the same bound in `long` as `offsetTableEnd` and guarded against it.
   
   For `offsetSize = 4`, `(1 + numElements) * offsetSize` overflows `int` 
around numElements ~536M — a valid 4-byte count. The `long` guard above still 
passes, so `dataOffset` wraps negative and we hit a `ByteBuffer` underflow at 
access time instead of the `IllegalArgumentException` this check is meant to 
produce.
   
   I'd just reuse the value we already have: `this.dataOffset = 
Math.toIntExact(offsetTableEnd)`. wdyt?



##########
api/src/main/java/org/apache/iceberg/variants/VariantUtil.java:
##########
@@ -30,8 +32,38 @@ class VariantUtil {
   private static final int BASIC_TYPE_OBJECT = 2;
   private static final int BASIC_TYPE_ARRAY = 3;
 
+  /**
+   * Maximum permitted nesting depth of a Variant value. The top-level value 
is depth 0, so a
+   * Variant may contain up to {@code MAX_VARIANT_DEPTH} nested levels.
+   */
+  static final int MAX_VARIANT_DEPTH = 500;
+

Review Comment:
   This is a non-spec limit — the Parquet VariantEncoding spec doesn't cap 
nesting depth and Iceberg defers to it. A valid variant with >500 levels 
written by Spark or another client would read fine elsewhere but throw here at 
scan time, which reads as silent data loss rather than malformed-input 
rejection.
   
   What did parquet-java#3562 settle on? If it picked a different value (or no 
cap), we'd diverge on file interop. I'd at minimum cross-reference that 
decision in a comment here and confirm 500 is high enough to never reject real 
data (it's below Jackson's default of 2000).



##########
api/src/main/java/org/apache/iceberg/variants/SerializedObject.java:
##########
@@ -60,18 +64,33 @@ static SerializedObject from(VariantMetadata metadata, 
ByteBuffer value, int hea
   private final int[] offsets;
   private final int[] lengths;
   private final int dataOffset;
+  private final int depth;
   private final VariantValue[] values;
 
-  private SerializedObject(VariantMetadata metadata, ByteBuffer value, int 
header) {
+  private SerializedObject(VariantMetadata metadata, ByteBuffer value, int 
header, int depth) {
     this.metadata = metadata;
     this.value = value;
+    this.depth = depth;
     this.offsetSize = 1 + ((header & OFFSET_SIZE_MASK) >> OFFSET_SIZE_SHIFT);
     this.fieldIdSize = 1 + ((header & FIELD_ID_SIZE_MASK) >> 
FIELD_ID_SIZE_SHIFT);
     int numElementsSize = ((header & IS_LARGE) == IS_LARGE) ? 4 : 1;
+    Preconditions.checkArgument(
+        value.remaining() >= HEADER_SIZE + numElementsSize,
+        "Invalid variant object: buffer too small for element count field");
     int numElements = ByteBuffers.readLittleEndianUnsigned(value, HEADER_SIZE, 
numElementsSize);
+    Preconditions.checkArgument(
+        numElements >= 0, "Invalid variant object: negative element count %s", 
numElements);
     this.fieldIdListOffset = HEADER_SIZE + numElementsSize;
-    this.fieldIds = new Integer[numElements];
+    long dataStart =
+        (long) fieldIdListOffset
+            + (long) numElements * fieldIdSize
+            + ((long) numElements + 1L) * offsetSize;
+    Preconditions.checkArgument(
+        dataStart <= value.remaining(),

Review Comment:
   Same int-overflow as the array side: `offsetListOffset` and `dataOffset` 
below are assigned in plain `int` while `dataStart` here is `long`. 
`numElements * fieldIdSize` and `(1 + numElements) * offsetSize` can overflow 
before the `long` guard's bound is ever applied to the int fields, wrapping the 
data offset negative for a large-but-valid count.
   
   I'd derive both offsets from the `long` intermediates and `Math.toIntExact` 
them so the guard actually protects the value we use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] API: Harden variant binary parsing against malformed input [iceberg]

Reply via email to