vinothchandar commented on code in PR #18146:
URL: https://github.com/apache/hudi/pull/18146#discussion_r2828112934
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -1660,6 +1919,81 @@ public void validate(Schema schema) {
}
}
+ static class VectorLogicalType extends LogicalType {
+ private static final String VECTOR_LOGICAL_TYPE_NAME = "vector";
+ private static final String PROP_DIMENSION = "dimension";
+ private static final String PROP_ELEMENT_TYPE = "elementType";
+ private static final String PROP_STORAGE_BACKING = "storageBacking";
+
+ private final int dimension;
+ private final String elementType;
+ private final String storageBacking;
+
+ public VectorLogicalType(int dimension, String elementType, String
storageBacking) {
+ super(VectorLogicalType.VECTOR_LOGICAL_TYPE_NAME);
+ ValidationUtils.checkArgument(dimension > 0,
+ () -> "Vector dimension must be positive: " + dimension);
+ ValidationUtils.checkArgument(elementType != null &&
!elementType.isEmpty(),
+ () -> "Element type cannot be null or empty");
+ ValidationUtils.checkArgument(storageBacking != null &&
!storageBacking.isEmpty(),
+ () -> "Storage backing cannot be null or empty");
+
+ this.dimension = dimension;
+ this.elementType = elementType;
+ this.storageBacking = storageBacking;
+ }
+
+ public int getDimension() {
+ return dimension;
+ }
+
+ public String getElementType() {
+ return elementType;
+ }
+
+ public String getStorageBacking() {
+ return storageBacking;
+ }
+
+ @Override
+ public Schema addToSchema(Schema schema) {
+ super.addToSchema(schema);
+ schema.addProp(PROP_DIMENSION, dimension);
+ schema.addProp(PROP_ELEMENT_TYPE, elementType);
+ schema.addProp(PROP_STORAGE_BACKING, storageBacking);
+ return schema;
+ }
Review Comment:
⚠️ **IMPORTANT: `VectorLogicalType` missing `validate()` override**
`VariantLogicalType` overrides `validate()` to ensure it's only applied to
RECORD schemas. `VectorLogicalType` does not override `validate()` — meaning
the logical type could be applied to any Avro schema type (STRING, INT, etc.)
without error. It should validate that the underlying schema is FIXED.
**Suggestion:** Add a `validate()` override:
```java
@Override
public void validate(Schema schema) {
super.validate(schema);
if (schema.getType() != Schema.Type.FIXED) {
throw new IllegalArgumentException(
"Vector logical type can only be applied to FIXED schemas, got: " +
schema.getType());
}
}
```
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -1660,6 +1919,81 @@ public void validate(Schema schema) {
}
}
+ static class VectorLogicalType extends LogicalType {
+ private static final String VECTOR_LOGICAL_TYPE_NAME = "vector";
+ private static final String PROP_DIMENSION = "dimension";
+ private static final String PROP_ELEMENT_TYPE = "elementType";
+ private static final String PROP_STORAGE_BACKING = "storageBacking";
+
+ private final int dimension;
+ private final String elementType;
+ private final String storageBacking;
+
+ public VectorLogicalType(int dimension, String elementType, String
storageBacking) {
+ super(VectorLogicalType.VECTOR_LOGICAL_TYPE_NAME);
+ ValidationUtils.checkArgument(dimension > 0,
+ () -> "Vector dimension must be positive: " + dimension);
+ ValidationUtils.checkArgument(elementType != null &&
!elementType.isEmpty(),
+ () -> "Element type cannot be null or empty");
+ ValidationUtils.checkArgument(storageBacking != null &&
!storageBacking.isEmpty(),
+ () -> "Storage backing cannot be null or empty");
+
+ this.dimension = dimension;
+ this.elementType = elementType;
+ this.storageBacking = storageBacking;
+ }
+
+ public int getDimension() {
+ return dimension;
+ }
+
+ public String getElementType() {
+ return elementType;
+ }
+
+ public String getStorageBacking() {
+ return storageBacking;
+ }
+
+ @Override
+ public Schema addToSchema(Schema schema) {
+ super.addToSchema(schema);
+ schema.addProp(PROP_DIMENSION, dimension);
+ schema.addProp(PROP_ELEMENT_TYPE, elementType);
+ schema.addProp(PROP_STORAGE_BACKING, storageBacking);
+ return schema;
+ }
+ }
+
+ /**
+ * Factory for creating VectorLogicalType instances.
+ */
+ private static class VectorLogicalTypeFactory implements
LogicalTypes.LogicalTypeFactory {
+ @Override
+ public LogicalType fromSchema(Schema schema) {
+ // Extract properties from schema
+ Object dimObj = schema.getObjectProp(VectorLogicalType.PROP_DIMENSION);
+ int dimension = dimObj instanceof Number ? ((Number) dimObj).intValue()
: 0;
Review Comment:
💬 **SUGGESTION: Factory silently defaults dimension to 0 when missing**
If `dimension` is not a Number in the schema props, it defaults to 0, then
`VectorLogicalType` constructor throws `"Vector dimension must be positive:
0"`. The error is misleading — it suggests dimension 0 was explicitly set, when
the property was actually missing/wrong type.
**Suggestion:**
```java
if (dimObj == null || !(dimObj instanceof Number)) {
throw new IllegalArgumentException("Vector schema missing required
'dimension' property");
}
int dimension = ((Number) dimObj).intValue();
```
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -1492,6 +1541,216 @@ public int hashCode() {
}
}
+ public static class Vector extends HoodieSchema {
+ private static final String DEFAULT_NAME = "vector";
+
+ /**
+ * Enum representing vector element data types.
+ */
+ public enum VectorElementType {
+ FLOAT("FLOAT", 4),
+ DOUBLE("DOUBLE", 8),
+ INT8("INT8", 1);
+
+ private final String dataType;
+ private final int elementSize;
+
+ VectorElementType(String dataType, int elementSize) {
+ this.dataType = dataType;
+ this.elementSize = elementSize;
+ }
+
+ /**
+ * Returns the string representation for serialization.
+ *
+ * @return element type name
+ */
+ public String getDataType() {
+ return dataType;
+ }
+
+ /**
+ * Returns the byte size of a single element.
+ *
+ * @return number of bytes per element
+ */
+ public int getElementSize() {
+ return elementSize;
+ }
+
+ /**
+ * Converts a string to VectorElementType enum.
+ *
+ * @param name the element type name (e.g., "FLOAT", "DOUBLE", "INT8")
+ * @return the corresponding enum value
+ * @throws IllegalArgumentException if name is unknown
+ */
+ public static VectorElementType fromString(String name) {
+ for (VectorElementType type : values()) {
+ if (type.dataType.equalsIgnoreCase(name)) {
+ return type;
+ }
+ }
+ throw new IllegalArgumentException("Unknown element type: " + name);
+ }
+ }
+
+ // Storage backing types
+ public static final String STORAGE_BACKING_FIXED_BYTES = "FIXED_BYTES";
+
+ private final int dimension;
+ private final VectorElementType elementType;
+ private final String storageBacking;
+
+ /**
+ * Creates Vector from pre-built schema (used by factory methods).
+ *
+ * @param avroSchema the Avro schema to wrap, must be a valid Vector schema
+ * @throws IllegalArgumentException if avroSchema is null or not a valid
Vector schema
+ */
+ private Vector(Schema avroSchema) {
+ super(avroSchema);
+
+ // Extract properties from LogicalType
+ LogicalType logicalType = avroSchema.getLogicalType();
+ if (!(logicalType instanceof VectorLogicalType)) {
+ throw new IllegalArgumentException(
+ "Schema must have VectorLogicalType, got: " + logicalType);
+ }
+
+ VectorLogicalType vectorLogicalType = (VectorLogicalType) logicalType;
+ this.dimension = vectorLogicalType.getDimension();
+ this.elementType =
VectorElementType.fromString(vectorLogicalType.getElementType());
+ this.storageBacking = vectorLogicalType.getStorageBacking();
+
+ // Validate schema structure
+ validateVectorSchema(avroSchema);
+ }
+
+ @Override
+ public String getName() {
+ return "vector";
+ }
+
+ @Override
+ public HoodieSchemaType getType() {
+ return HoodieSchemaType.VECTOR;
+ }
+
+ /**
+ * Gets the byte size of a single element based on element type.
+ *
+ * @param elementType the element type
+ * @return number of bytes per element
+ */
+ private static int getElementSize(VectorElementType elementType) {
+ return elementType.getElementSize();
+ }
+
+ /**
+ * Creates vector schema with specified dimension and element type.
+ *
+ * @param name fixed type name (not null)
+ * @param dimension vector dimension (must be > 0)
Review Comment:
💅 **NIT: Unnecessary static method `getElementSize(VectorElementType)`**
This method just delegates to `elementType.getElementSize()`. Called in two
places where the enum method could be invoked directly.
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -1492,6 +1541,216 @@ public int hashCode() {
}
}
+ public static class Vector extends HoodieSchema {
+ private static final String DEFAULT_NAME = "vector";
+
+ /**
+ * Enum representing vector element data types.
+ */
+ public enum VectorElementType {
+ FLOAT("FLOAT", 4),
+ DOUBLE("DOUBLE", 8),
+ INT8("INT8", 1);
+
+ private final String dataType;
+ private final int elementSize;
+
+ VectorElementType(String dataType, int elementSize) {
+ this.dataType = dataType;
+ this.elementSize = elementSize;
+ }
+
+ /**
+ * Returns the string representation for serialization.
+ *
+ * @return element type name
+ */
+ public String getDataType() {
+ return dataType;
+ }
+
+ /**
+ * Returns the byte size of a single element.
+ *
+ * @return number of bytes per element
+ */
+ public int getElementSize() {
+ return elementSize;
+ }
+
+ /**
+ * Converts a string to VectorElementType enum.
+ *
+ * @param name the element type name (e.g., "FLOAT", "DOUBLE", "INT8")
+ * @return the corresponding enum value
+ * @throws IllegalArgumentException if name is unknown
+ */
+ public static VectorElementType fromString(String name) {
+ for (VectorElementType type : values()) {
+ if (type.dataType.equalsIgnoreCase(name)) {
+ return type;
+ }
+ }
+ throw new IllegalArgumentException("Unknown element type: " + name);
+ }
+ }
+
+ // Storage backing types
+ public static final String STORAGE_BACKING_FIXED_BYTES = "FIXED_BYTES";
+
+ private final int dimension;
+ private final VectorElementType elementType;
+ private final String storageBacking;
+
+ /**
+ * Creates Vector from pre-built schema (used by factory methods).
+ *
+ * @param avroSchema the Avro schema to wrap, must be a valid Vector schema
+ * @throws IllegalArgumentException if avroSchema is null or not a valid
Vector schema
+ */
+ private Vector(Schema avroSchema) {
+ super(avroSchema);
+
+ // Extract properties from LogicalType
+ LogicalType logicalType = avroSchema.getLogicalType();
+ if (!(logicalType instanceof VectorLogicalType)) {
+ throw new IllegalArgumentException(
+ "Schema must have VectorLogicalType, got: " + logicalType);
+ }
+
+ VectorLogicalType vectorLogicalType = (VectorLogicalType) logicalType;
+ this.dimension = vectorLogicalType.getDimension();
+ this.elementType =
VectorElementType.fromString(vectorLogicalType.getElementType());
+ this.storageBacking = vectorLogicalType.getStorageBacking();
+
+ // Validate schema structure
+ validateVectorSchema(avroSchema);
+ }
+
+ @Override
+ public String getName() {
+ return "vector";
+ }
+
+ @Override
+ public HoodieSchemaType getType() {
+ return HoodieSchemaType.VECTOR;
+ }
+
+ /**
+ * Gets the byte size of a single element based on element type.
+ *
+ * @param elementType the element type
Review Comment:
💅 **NIT: **
`Vector.getName()` override is redundant? The base class
`HoodieSchema.getName()` already returns `type.name().toLowerCase()` =
`"vector"` when a logical type is present (line 710-711). This override returns
the same value. Should we think for this type (and also probably others)
whether `return getAvroSchema().getName()` to honor custom names passed via
`createVector(name, ...)`.
##########
hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java:
##########
@@ -838,6 +839,261 @@ void testCreateDecimalSchema() {
assertEquals(5, decimalFixedSchema.getFixedSize());
}
+ @Test
+ void testCreateVectorWithDimension() {
+ // Create vector with dimension only (defaults to FLOAT)
+ HoodieSchema schema = HoodieSchema.createVector(1536);
+
+ HoodieSchema.Vector vectorSchema = assertVector(schema, 1536,
HoodieSchema.Vector.VectorElementType.FLOAT);
+ assertEquals(HoodieSchemaType.VECTOR, schema.getType());
+
+ assertTrue(schema.getAvroSchema().getLogicalType() instanceof
VectorLogicalType);
+
+ // Verify properties are at schema level
+ assertVectorAvroProperties(vectorSchema, 1536,
HoodieSchema.Vector.VectorElementType.FLOAT);
+
+ // Verify Vector is FIXED type (not RECORD)
+ Schema avroSchema = vectorSchema.getAvroSchema();
+ assertEquals(Schema.Type.FIXED, avroSchema.getType());
+ assertFalse(vectorSchema.hasFields());
+
+ // Verify FIXED size = dimension × elementSize (1536 × 4 bytes for FLOAT)
+ assertEquals(1536 * 4, avroSchema.getFixedSize());
+ }
+
+ @Test
+ void testCreateVectorWithNameAndDimension() {
+ // Create vector with custom name and dimension
+ HoodieSchema schema = HoodieSchema.createVector("embeddings", 768);
+ HoodieSchema.Vector vectorSchema = assertVector(schema, 768,
HoodieSchema.Vector.VectorElementType.FLOAT);
+ assertEquals(HoodieSchemaType.VECTOR, schema.getType());
+ assertEquals("embeddings", vectorSchema.getAvroSchema().getName());
+ }
+
+ @Test
+ void testCreateVectorWithDimensionAndElementType() {
+ // Create vector with DOUBLE element type
+ HoodieSchema schemaDouble = HoodieSchema.createVector(1536,
HoodieSchema.Vector.VectorElementType.DOUBLE);
Review Comment:
💬 **SUGGESTION: Missing test for INT8 element type**
Tests cover FLOAT and DOUBLE but never exercise `VectorElementType.INT8`.
Since INT8 is in the enum (1 byte per element), add at least one test to verify
correct FIXED size calculation.
**Suggestion:**
```java
HoodieSchema.Vector vectorInt8 = HoodieSchema.createVector(256,
HoodieSchema.Vector.VectorElementType.INT8);
assertEquals(256, vectorInt8.getFixedSize()); // 256 × 1 byte
assertEquals(HoodieSchema.Vector.VectorElementType.INT8,
vectorInt8.getVectorElementType());
```
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -632,6 +635,52 @@ public static HoodieSchema.Variant
createVariantShredded(String name, String nam
return new HoodieSchema.Variant(recordSchema);
}
+ /**
+ * Creates Vector schema with default name and specified dimension.
+ *
+ * @param dimension vector dimension (must be > 0)
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(int dimension) {
+ return createVector(null, dimension);
+ }
+
+ /**
+ * Creates Vector schema with custom name and dimension.
+ *
+ * @param name record name (null uses default "vector")
+ * @param dimension vector dimension (must be > 0)
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(String name, int dimension) {
Review Comment:
💅 **NIT** rename createFloatVector() ? (assuming we anticipate, vectors
of couple different types)
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -1492,6 +1541,216 @@ public int hashCode() {
}
}
+ public static class Vector extends HoodieSchema {
+ private static final String DEFAULT_NAME = "vector";
+
+ /**
+ * Enum representing vector element data types.
+ */
+ public enum VectorElementType {
+ FLOAT("FLOAT", 4),
+ DOUBLE("DOUBLE", 8),
+ INT8("INT8", 1);
+
+ private final String dataType;
+ private final int elementSize;
+
+ VectorElementType(String dataType, int elementSize) {
+ this.dataType = dataType;
+ this.elementSize = elementSize;
+ }
+
+ /**
+ * Returns the string representation for serialization.
+ *
+ * @return element type name
+ */
+ public String getDataType() {
+ return dataType;
+ }
+
+ /**
+ * Returns the byte size of a single element.
+ *
+ * @return number of bytes per element
+ */
+ public int getElementSize() {
+ return elementSize;
+ }
+
+ /**
+ * Converts a string to VectorElementType enum.
+ *
+ * @param name the element type name (e.g., "FLOAT", "DOUBLE", "INT8")
+ * @return the corresponding enum value
+ * @throws IllegalArgumentException if name is unknown
+ */
+ public static VectorElementType fromString(String name) {
+ for (VectorElementType type : values()) {
+ if (type.dataType.equalsIgnoreCase(name)) {
+ return type;
+ }
+ }
+ throw new IllegalArgumentException("Unknown element type: " + name);
+ }
+ }
+
+ // Storage backing types
+ public static final String STORAGE_BACKING_FIXED_BYTES = "FIXED_BYTES";
Review Comment:
turn this into an enum as well .. ? 1 element for now
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -632,6 +635,52 @@ public static HoodieSchema.Variant
createVariantShredded(String name, String nam
return new HoodieSchema.Variant(recordSchema);
}
+ /**
+ * Creates Vector schema with default name and specified dimension.
+ *
+ * @param dimension vector dimension (must be > 0)
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(int dimension) {
+ return createVector(null, dimension);
Review Comment:
null for name. no side effects?
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -1660,6 +1919,81 @@ public void validate(Schema schema) {
}
}
+ static class VectorLogicalType extends LogicalType {
+ private static final String VECTOR_LOGICAL_TYPE_NAME = "vector";
+ private static final String PROP_DIMENSION = "dimension";
+ private static final String PROP_ELEMENT_TYPE = "elementType";
+ private static final String PROP_STORAGE_BACKING = "storageBacking";
+
+ private final int dimension;
+ private final String elementType;
+ private final String storageBacking;
+
+ public VectorLogicalType(int dimension, String elementType, String
storageBacking) {
+ super(VectorLogicalType.VECTOR_LOGICAL_TYPE_NAME);
+ ValidationUtils.checkArgument(dimension > 0,
+ () -> "Vector dimension must be positive: " + dimension);
+ ValidationUtils.checkArgument(elementType != null &&
!elementType.isEmpty(),
+ () -> "Element type cannot be null or empty");
+ ValidationUtils.checkArgument(storageBacking != null &&
!storageBacking.isEmpty(),
+ () -> "Storage backing cannot be null or empty");
+
+ this.dimension = dimension;
+ this.elementType = elementType;
+ this.storageBacking = storageBacking;
+ }
+
+ public int getDimension() {
+ return dimension;
+ }
+
+ public String getElementType() {
+ return elementType;
+ }
+
+ public String getStorageBacking() {
+ return storageBacking;
+ }
+
+ @Override
+ public Schema addToSchema(Schema schema) {
+ super.addToSchema(schema);
+ schema.addProp(PROP_DIMENSION, dimension);
+ schema.addProp(PROP_ELEMENT_TYPE, elementType);
+ schema.addProp(PROP_STORAGE_BACKING, storageBacking);
+ return schema;
+ }
+ }
+
+ /**
+ * Factory for creating VectorLogicalType instances.
+ */
+ private static class VectorLogicalTypeFactory implements
LogicalTypes.LogicalTypeFactory {
+ @Override
+ public LogicalType fromSchema(Schema schema) {
+ // Extract properties from schema
+ Object dimObj = schema.getObjectProp(VectorLogicalType.PROP_DIMENSION);
+ int dimension = dimObj instanceof Number ? ((Number) dimObj).intValue()
: 0;
Review Comment:
Should we disallow vectors of dimension 0. and throw an error?
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -632,6 +635,52 @@ public static HoodieSchema.Variant
createVariantShredded(String name, String nam
return new HoodieSchema.Variant(recordSchema);
}
+ /**
+ * Creates Vector schema with default name and specified dimension.
+ *
+ * @param dimension vector dimension (must be > 0)
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(int dimension) {
+ return createVector(null, dimension);
+ }
+
+ /**
+ * Creates Vector schema with custom name and dimension.
+ *
+ * @param name record name (null uses default "vector")
+ * @param dimension vector dimension (must be > 0)
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(String name, int dimension) {
+ return createVector(name, dimension, Vector.VectorElementType.FLOAT);
+ }
+
+ /**
+ * Creates Vector schema with custom dimension and element type.
+ *
+ * @param dimension vector dimension (must be > 0)
+ * @param elementType element type (use {@link
Vector.VectorElementType#FLOAT} or {@link Vector.VectorElementType#DOUBLE})
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(int dimension,
Vector.VectorElementType elementType) {
+ return createVector(null, dimension, elementType);
+ }
+
+ /**
+ * Creates Vector schema with custom name, dimension, and element type.
+ *
+ * @param name record name (null uses default "vector")
+ * @param dimension vector dimension (must be > 0)
+ * @param elementType element type (use {@link
Vector.VectorElementType#FLOAT} or {@link Vector.VectorElementType#DOUBLE})
+ * @return new HoodieSchema.Vector
+ */
+ public static HoodieSchema.Vector createVector(String name, int dimension,
Vector.VectorElementType elementType) {
+ String vectorName = (name != null && !name.isEmpty()) ? name :
Vector.DEFAULT_NAME;
Review Comment:
pass in the DEFAULT_NAME instead of teh `null` indirection?>
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchemaType.java:
##########
@@ -119,6 +119,8 @@ public enum HoodieSchemaType {
VARIANT(Schema.Type.RECORD),
+ VECTOR(Schema.Type.FIXED),
Review Comment:
⚠️ **IMPORTANT**
thinking out loud.. the other two pieces of information, elementType and
dimension are same across all records, so storing them here is pure overhead..
Should we define the vector to be an RECORD with a `vectors` field under,
for future flexibility? i.e say we decide to encode extra pieces of info for
sparse vectors? Do you anticipate such evolutions that will then make `FIXED`
inflexible to evolve to add new fields..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]