laskoviymishka commented on code in PR #16568:
URL: https://github.com/apache/iceberg/pull/16568#discussion_r3421232920
##########
api/src/main/java/org/apache/iceberg/variants/VariantUtil.java:
##########
@@ -30,8 +32,38 @@ class VariantUtil {
private static final int BASIC_TYPE_OBJECT = 2;
private static final int BASIC_TYPE_ARRAY = 3;
+ /**
+ * Maximum permitted nesting depth of a Variant value. The top-level value
is depth 0, so a
+ * Variant may contain up to {@code MAX_VARIANT_DEPTH} nested levels.
+ */
+ static final int MAX_VARIANT_DEPTH = 500;
+
private VariantUtil() {}
+ /** Parses a variant value from {@code value} using {@code metadata} for
field-name resolution. */
+ static VariantValue fromBuffer(VariantMetadata metadata, ByteBuffer value,
int depth) {
+ Preconditions.checkArgument(
+ depth <= MAX_VARIANT_DEPTH,
+ "Invalid variant: nesting depth %s exceeds maximum %s",
Review Comment:
The Javadoc on `MAX_VARIANT_DEPTH` says a variant "may contain up to
MAX_VARIANT_DEPTH nested levels," but `depth <= MAX_VARIANT_DEPTH` with the
top-level at depth 0 actually admits 501 levels (depths 0 through 500). It's
internally consistent, just off-by-one against the comment.
I'd either flip this to `depth < MAX_VARIANT_DEPTH` or update the Javadoc to
state the counting model exactly. A `depth >= 0` lower bound here would also
close the direct-call path where `depth + 1` could overflow past the cap.
##########
api/src/main/java/org/apache/iceberg/variants/SerializedArray.java:
##########
@@ -35,29 +35,46 @@ static SerializedArray from(VariantMetadata metadata,
byte[] bytes) {
return from(metadata,
ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN), bytes[0]);
}
+ @VisibleForTesting
static SerializedArray from(VariantMetadata metadata, ByteBuffer value, int
header) {
+ return from(metadata, value, header, 0);
+ }
+
+ static SerializedArray from(VariantMetadata metadata, ByteBuffer value, int
header, int depth) {
Preconditions.checkArgument(
value.order() == ByteOrder.LITTLE_ENDIAN, "Unsupported byte order: big
endian");
BasicType basicType = VariantUtil.basicType(header);
Preconditions.checkArgument(
basicType == BasicType.ARRAY, "Invalid array, basic type: " +
basicType);
- return new SerializedArray(metadata, value, header);
+ return new SerializedArray(metadata, value, header, depth);
}
private final VariantMetadata metadata;
private final ByteBuffer value;
private final int offsetSize;
private final int offsetListOffset;
private final int dataOffset;
+ private final int depth;
private final VariantValue[] array;
- private SerializedArray(VariantMetadata metadata, ByteBuffer value, int
header) {
+ private SerializedArray(VariantMetadata metadata, ByteBuffer value, int
header, int depth) {
this.metadata = metadata;
this.value = value;
+ this.depth = depth;
this.offsetSize = 1 + ((header & OFFSET_SIZE_MASK) >> OFFSET_SIZE_SHIFT);
int numElementsSize = ((header & IS_LARGE) == IS_LARGE) ? 4 : 1;
+ Preconditions.checkArgument(
+ value.remaining() >= HEADER_SIZE + numElementsSize,
+ "Invalid variant array: buffer too small for element count field");
int numElements = ByteBuffers.readLittleEndianUnsigned(value, HEADER_SIZE,
numElementsSize);
+ Preconditions.checkArgument(
+ numElements >= 0, "Invalid variant array: negative element count %s",
numElements);
this.offsetListOffset = HEADER_SIZE + numElementsSize;
+ long offsetTableEnd = (long) offsetListOffset + ((long) numElements + 1L)
* offsetSize;
+ Preconditions.checkArgument(
+ offsetTableEnd <= value.remaining(),
+ "Invalid variant array: element count %s exceeds buffer",
+ numElements);
this.dataOffset = offsetListOffset + ((1 + numElements) * offsetSize);
Review Comment:
`dataOffset` here is assigned with plain-`int` arithmetic, but we just
computed the same bound in `long` as `offsetTableEnd` and guarded against it.
For `offsetSize = 4`, `(1 + numElements) * offsetSize` overflows `int`
around numElements ~536M — a valid 4-byte count. The `long` guard above still
passes, so `dataOffset` wraps negative and we hit a `ByteBuffer` underflow at
access time instead of the `IllegalArgumentException` this check is meant to
produce.
I'd just reuse the value we already have: `this.dataOffset =
Math.toIntExact(offsetTableEnd)`. wdyt?
##########
api/src/main/java/org/apache/iceberg/variants/VariantUtil.java:
##########
@@ -30,8 +32,38 @@ class VariantUtil {
private static final int BASIC_TYPE_OBJECT = 2;
private static final int BASIC_TYPE_ARRAY = 3;
+ /**
+ * Maximum permitted nesting depth of a Variant value. The top-level value
is depth 0, so a
+ * Variant may contain up to {@code MAX_VARIANT_DEPTH} nested levels.
+ */
+ static final int MAX_VARIANT_DEPTH = 500;
+
Review Comment:
This is a non-spec limit — the Parquet VariantEncoding spec doesn't cap
nesting depth and Iceberg defers to it. A valid variant with >500 levels
written by Spark or another client would read fine elsewhere but throw here at
scan time, which reads as silent data loss rather than malformed-input
rejection.
What did parquet-java#3562 settle on? If it picked a different value (or no
cap), we'd diverge on file interop. I'd at minimum cross-reference that
decision in a comment here and confirm 500 is high enough to never reject real
data (it's below Jackson's default of 2000).
##########
api/src/main/java/org/apache/iceberg/variants/SerializedObject.java:
##########
@@ -60,18 +64,33 @@ static SerializedObject from(VariantMetadata metadata,
ByteBuffer value, int hea
private final int[] offsets;
private final int[] lengths;
private final int dataOffset;
+ private final int depth;
private final VariantValue[] values;
- private SerializedObject(VariantMetadata metadata, ByteBuffer value, int
header) {
+ private SerializedObject(VariantMetadata metadata, ByteBuffer value, int
header, int depth) {
this.metadata = metadata;
this.value = value;
+ this.depth = depth;
this.offsetSize = 1 + ((header & OFFSET_SIZE_MASK) >> OFFSET_SIZE_SHIFT);
this.fieldIdSize = 1 + ((header & FIELD_ID_SIZE_MASK) >>
FIELD_ID_SIZE_SHIFT);
int numElementsSize = ((header & IS_LARGE) == IS_LARGE) ? 4 : 1;
+ Preconditions.checkArgument(
+ value.remaining() >= HEADER_SIZE + numElementsSize,
+ "Invalid variant object: buffer too small for element count field");
int numElements = ByteBuffers.readLittleEndianUnsigned(value, HEADER_SIZE,
numElementsSize);
+ Preconditions.checkArgument(
+ numElements >= 0, "Invalid variant object: negative element count %s",
numElements);
this.fieldIdListOffset = HEADER_SIZE + numElementsSize;
- this.fieldIds = new Integer[numElements];
+ long dataStart =
+ (long) fieldIdListOffset
+ + (long) numElements * fieldIdSize
+ + ((long) numElements + 1L) * offsetSize;
+ Preconditions.checkArgument(
+ dataStart <= value.remaining(),
Review Comment:
Same int-overflow as the array side: `offsetListOffset` and `dataOffset`
below are assigned in plain `int` while `dataStart` here is `long`.
`numElements * fieldIdSize` and `(1 + numElements) * offsetSize` can overflow
before the `long` guard's bound is ever applied to the int fields, wrapping the
data offset negative for a large-but-valid count.
I'd derive both offsets from the `long` intermediates and `Math.toIntExact`
them so the guard actually protects the value we use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]