Re: [PR] feat(flink): Support writing VECTOR columns for flink writer [hudi]

via GitHub Tue, 02 Jun 2026 17:55:45 -0700


hudi-agent commented on code in PR #18877:
URL: https://github.com/apache/hudi/pull/18877#discussion_r3345317570



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -678,7 +678,8 @@ private MergeOnReadInputFormat mergeOnReadInputFormat(
         this.limit == NO_LIMIT_CONSTANT ? Long.MAX_VALUE : this.limit, // 
ParquetInputFormat always uses the limit value
         getParquetConf(this.conf, this.hadoopConf.unwrap()),
         this.conf.get(FlinkOptions.READ_UTC_TIMEZONE),
-        this.internalSchemaManager
+        this.internalSchemaManager,
+        tableSchema
     );
   }
 

Review Comment:
   🤖 [Line 687] Should `inferSchemaFromDdl()` also forward 
`conf.get(FlinkOptions.VECTOR_COLUMNS)` to the converter? Today it calls the 
1-arg `convertToSchema(this.tableRowType)`, so when a stream reader starts 
against an empty table created with `hoodie.vector.columns=...`, the inferred 
schema loses the VECTOR metadata and subsequent reads of written files may fail 
to decode the FIXED_LEN_BYTE_ARRAY payload.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java:
##########
@@ -169,6 +169,7 @@ public void open() throws Exception {
     this.table = writeClient.getHoodieTable();
 
     this.schema = HoodieSchemaConverter.convertToSchema(rowType);
+    this.writeConfig.setSchema(schema.toString());

Review Comment:
   🤖 This override drops VECTOR metadata from `writeConfig.schema`. `rowType` 
here is ultimately built from 
`HoodieSchemaConverter.convertToDataType(tableSchema)` (see 
`HoodieFlinkClusteringJob#cluster`), which maps VECTOR fields back into 
`ARRAY<FLOAT/DOUBLE/TINYINT>`. After `convertToSchema(rowType)` the schema 
stored here is a plain ARRAY schema. Since 
`HoodieRowDataCreateHandle#createNewFileWriter` now reads the writer schema 
from `config.getSchema()`, clustering would then rewrite VECTOR columns as 
regular LIST parquet types and drop the `hoodie.vector.columns` footer key. 
Could you verify with a clustering round-trip on a VECTOR table — e.g., derive 
the writer schema from `tableSchema` (with metadata fields added) instead of 
from `rowType`?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-flink-datasource/hudi-flink1.17.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java:
##########
@@ -365,7 +365,15 @@ private static ColumnReader createColumnReader(
       case VARCHAR:
       case BINARY:
       case VARBINARY:
-        return new BytesColumnReader(descriptor, pageReader);
+        switch (descriptor.getPrimitiveType().getPrimitiveTypeName()) {
+          case BINARY:
+            return new BytesColumnReader(descriptor, pageReader);
+          case FIXED_LEN_BYTE_ARRAY:
+            return new FixedLenBytesColumnReader(
+                descriptor, pageReader);
+          default:
+            throw new AssertionError();

Review Comment:
   🤖 Minor: the new `default: throw new AssertionError();` for the 
CHAR/VARCHAR/BINARY/VARBINARY switch has no message. If a parquet file ever 
shows up with, say, INT32/INT64 for one of these logical types (uncommon, but 
the prior `BytesColumnReader` would at least surface a typed error), debugging 
this AssertionError would be hard. Could you include the primitive type name in 
the message — same pattern as the existing 
`getPrimitiveTypeCheckFailureMessage` used in `createWritableColumnVector`? 
Also applies to the matching changes in 1.18.x, 1.19.x, 1.20.x, 2.0.x, and 
2.1.x.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/parquet/ParquetSchemaConverter.java:
##########
@@ -367,14 +402,24 @@ private static Type convertToParquetType(
                 Types
                     .repeatedGroup()
                     .addField(convertToParquetType("key", keyType, 
Type.Repetition.REQUIRED))
-                    .addField(convertToParquetType("value", valueType, 
valueType.isNullable() ? Type.Repetition.OPTIONAL : Type.Repetition.REQUIRED))
+                    .addField(convertToParquetType(
+                        "value",

Review Comment:
   🤖 nit: it might be worth adding an error message to this `checkArgument`, 
the same way the analogous check in `ParquetRowDataWriter` does (`"Hoodie 
schema should be RECORD type."`). Without a message, the resulting 
`IllegalArgumentException` gives no hint about what went wrong.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriteSupport.java:
##########
@@ -52,6 +52,11 @@ public WriteSupport.FinalizedWriteContext finalizeWrite() {
     Map<String, String> extraMetadata =
         
bloomFilterWriteSupportOpt.map(HoodieBloomFilterWriteSupport::finalizeMetadata)
             .orElse(Collections.emptyMap());
+    String vectorColumnsMetadata = 
HoodieSchema.buildVectorColumnsMetadataValue(hoodieSchema);
+    if (!vectorColumnsMetadata.isEmpty()) {
+      extraMetadata = new java.util.HashMap<>(extraMetadata);

Review Comment:
   🤖 nit: could you add an import for `HashMap` rather than using the 
fully-qualified `java.util.HashMap<>` inline? Fully-qualified names in method 
bodies are pretty unusual and stand out to readers.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java:
##########
@@ -238,6 +280,97 @@ public static HoodieSchema convertToSchema(LogicalType 
logicalType, String rowNa
     return nullable ? HoodieSchema.createNullable(schema) : schema;
   }
 
+  private static Map<String, Integer> parseVectorColumns(String vectorColumns) 
{
+    Map<String, Integer> parsed = new LinkedHashMap<>();
+    for (String rawEntry : vectorColumns.split(",")) {
+      String entry = rawEntry.trim();
+      if (entry.isEmpty()) {
+        continue;
+      }
+      String[] parts = entry.split(":", -1);
+      if (parts.length > 2 || parts[0].trim().isEmpty()) {
+        throw new IllegalArgumentException(
+            "Invalid VECTOR column descriptor '" + entry + "'. Expected 
format: columnName[:dimension].");
+      }
+      String columnName = parts[0].trim();
+      String normalizedColumnName = columnName.toLowerCase(Locale.ROOT);
+      if (parsed.containsKey(normalizedColumnName)) {
+        throw new IllegalArgumentException("Duplicate VECTOR column descriptor 
for column: " + columnName);
+      }
+      int dimension = DEFAULT_VECTOR_DIMENSION;
+      if (parts.length == 2) {
+        String dimensionText = parts[1].trim();
+        if (dimensionText.isEmpty()) {
+          throw new IllegalArgumentException(
+              "Invalid VECTOR column descriptor '" + entry + "'. Dimension 
must not be empty.");
+        }
+        try {
+          dimension = Integer.parseInt(dimensionText);
+        } catch (NumberFormatException e) {
+          throw new IllegalArgumentException("Invalid VECTOR dimension for 
column '" + columnName + "': " + dimensionText, e);
+        }
+      }
+      if (dimension <= 0) {
+        throw new IllegalArgumentException("VECTOR dimension must be positive 
for column '" + columnName + "': " + dimension);
+      }
+      parsed.put(normalizedColumnName, dimension);
+    }
+    return parsed;
+  }
+
+  private static HoodieSchema convertVectorField(String fieldName, LogicalType 
fieldType, int dimension) {
+    if (!(fieldType instanceof ArrayType)) {
+      throw new IllegalArgumentException(
+          "VECTOR column '" + fieldName + "' must be declared as ARRAY<FLOAT>, 
ARRAY<DOUBLE>, or ARRAY<TINYINT>, but got: "
+              + fieldType.asSummaryString());
+    }
+    HoodieSchema.Vector.VectorElementType elementType = 
inferVectorElementType(fieldName, ((ArrayType) fieldType).getElementType());
+    HoodieSchema vectorSchema = HoodieSchema.createVector(dimension, 
elementType);
+    return fieldType.isNullable() ? HoodieSchema.createNullable(vectorSchema) 
: vectorSchema;
+  }
+
+  private static HoodieSchema.Vector.VectorElementType 
inferVectorElementType(String fieldName, LogicalType elementType) {
+    switch (elementType.getTypeRoot()) {
+      case FLOAT:
+        return HoodieSchema.Vector.VectorElementType.FLOAT;
+      case DOUBLE:
+        return HoodieSchema.Vector.VectorElementType.DOUBLE;
+      case TINYINT:
+        return HoodieSchema.Vector.VectorElementType.INT8;
+      default:
+        throw new IllegalArgumentException(
+            "VECTOR column '" + fieldName + "' must use ARRAY<FLOAT>, 
ARRAY<DOUBLE>, or ARRAY<TINYINT>, but got ARRAY<"
+                + elementType.asSummaryString() + ">.");
+    }
+  }
+
+  /**
+   * Validates that all configured VECTOR column descriptors resolve to 
top-level fields.
+   *
+   * <p>Callers that accept {@code hoodie.vector.columns} as a table option 
should run this
+   * validation before schema inference so an unknown column is rejected 
instead of being
+   * silently ignored by schema conversion.
+   *
+   * @param logicalType      Flink logical type
+   * @param vectorColumnsStr comma-separated vector column descriptors, or 
null/empty
+   */
+  public static void validateVectorColumns(LogicalType logicalType, String 
vectorColumnsStr) {
+    if (vectorColumnsStr == null || vectorColumnsStr.trim().isEmpty()) {
+      return;
+    }
+    Map<String, Integer> vectorColumns = parseVectorColumns(vectorColumnsStr);
+    List<String> fieldNames = ((RowType) logicalType).getFieldNames();
+    List<String> normalizedFieldNames = fieldNames.stream()
+        .map(fieldName -> fieldName.toLowerCase(Locale.ROOT))
+        .collect(Collectors.toList());
+    vectorColumns.keySet().stream()

Review Comment:
   🤖 nit: throwing inside `ifPresent` is a bit unusual — have you considered 
extracting the `Optional` into a variable and using a plain `if 
(opt.isPresent()) throw ...`? It reads more naturally and makes the intent 
immediately clear.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/VectorConversionUtils.java:
##########
@@ -0,0 +1,338 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.util;
+
+import org.apache.hudi.common.schema.HoodieSchema;
+import org.apache.hudi.common.schema.HoodieSchemaType;
+import org.apache.hudi.common.util.HoodieVectorUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.common.util.collection.ClosableIterator;
+import org.apache.hudi.common.util.collection.CloseableMappingIterator;
+import org.apache.hudi.exception.SchemaCompatibilityException;
+
+import org.apache.flink.table.api.DataTypes;
+import org.apache.flink.table.data.ArrayData;
+import org.apache.flink.table.data.GenericArrayData;
+import org.apache.flink.table.data.GenericRowData;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.types.DataType;
+import org.apache.flink.table.types.logical.ArrayType;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.LogicalTypeRoot;
+import org.apache.flink.table.types.logical.RowType;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Utilities for reading Hoodie VECTOR columns into Flink {@link RowData}.
+ *
+ * <p>VECTOR columns are stored as fixed-length bytes in parquet/log files. 
The Flink reader
+ * reads those physical bytes first and then decodes them into Flink array 
data according to
+ * the vector element type declared in {@link HoodieSchema.Vector}.
+ */
+public final class VectorConversionUtils {
+
+  private VectorConversionUtils() {
+  }
+
+  /**
+   * Detects VECTOR columns in the selected read projection.
+   *
+   * <p>The returned map is keyed by the projected field ordinal, not the 
ordinal in the full
+   * table schema. This matches the ordinal layout of the {@link RowData} 
emitted by the reader.
+   *
+   * @param fullFieldNames field names in the full query/table row type
+   * @param selectedFields ordinals selected from {@code fullFieldNames}
+   * @param tableSchema    hoodie table schema containing logical VECTOR type 
metadata
+   * @return projected ordinal to vector schema information
+   */
+  public static Map<Integer, HoodieSchema.Vector> detectVectorColumns(
+      String[] fullFieldNames,
+      int[] selectedFields,
+      HoodieSchema tableSchema) {
+    Map<String, HoodieSchema.Vector> vectorFields = 
getVectorFields(tableSchema);
+    Map<Integer, HoodieSchema.Vector> vectorColumnInfo = new LinkedHashMap<>();
+    for (int i = 0; i < selectedFields.length; i++) {
+      HoodieSchema.Vector vector = 
vectorFields.get(fullFieldNames[selectedFields[i]].toLowerCase(Locale.ROOT));
+      if (vector != null) {
+        vectorColumnInfo.put(i, vector);
+      }
+    }
+    return vectorColumnInfo;
+  }
+
+  /**
+   * Rewrites VECTOR fields in a requested row data type to BYTES for parquet 
reads.
+   *
+   * <p>Parquet stores VECTOR values as fixed-length byte arrays. This method 
keeps the requested
+   * row shape unchanged, but asks the physical parquet reader to materialize 
VECTOR columns as
+   * bytes. The resulting rows should be passed through
+   * {@link #wrapVectorColumnIterator(ClosableIterator, RowType, Map)} to 
decode the bytes into
+   * array data.
+   *
+   * @param dataType         requested row data type
+   * @param requestedSchema  requested Hudi schema
+   * @param vectorColumnInfo projected VECTOR column metadata
+   * @return row data type to use for the physical parquet read
+   */
+  public static DataType getParquetReadDataType(
+      DataType dataType,
+      HoodieSchema requestedSchema,
+      Map<Integer, HoodieSchema.Vector> vectorColumnInfo) {
+    if (vectorColumnInfo.isEmpty()) {
+      return dataType;
+    }
+
+    RowType rowType = (RowType) dataType.getLogicalType();
+    List<RowType.RowField> readFields = new 
ArrayList<>(rowType.getFields().size());
+    for (RowType.RowField field : rowType.getFields()) {
+      readFields.add(field.copy());
+    }
+    for (Integer requestedOrdinal : vectorColumnInfo.keySet()) {
+      String fieldName = 
requestedSchema.getFields().get(requestedOrdinal).name();
+      int fieldOrdinal = rowType.getFieldIndex(fieldName);
+      if (fieldOrdinal >= 0) {
+        RowType.RowField field = readFields.get(fieldOrdinal);
+        readFields.set(fieldOrdinal, newRowField(field, 
bytesType(field.getType())));
+      }
+    }
+    return DataTypes.of(new RowType(rowType.isNullable(), readFields));
+  }
+
+  /**
+   * Rewrites VECTOR field types in the full field type array to BYTES for 
parquet reads.
+   *
+   * <p>This variant is used by copy-on-write input format code paths that 
keep the full field
+   * type array and apply projection separately.
+   *
+   * @param fullFieldNames field names in the full query/table row type
+   * @param fullFieldTypes field types corresponding to {@code fullFieldNames}
+   * @param tableSchema    hoodie table schema containing logical VECTOR type 
metadata
+   * @return field types to use for the physical parquet read
+   */
+  public static DataType[] getParquetReadFieldTypes(
+      String[] fullFieldNames,
+      DataType[] fullFieldTypes,
+      HoodieSchema tableSchema) {
+    Map<String, HoodieSchema.Vector> vectorFields = 
getVectorFields(tableSchema);
+    if (vectorFields.isEmpty()) {
+      return fullFieldTypes;
+    }
+    DataType[] readFieldTypes = Arrays.copyOf(fullFieldTypes, 
fullFieldTypes.length);
+    for (int i = 0; i < fullFieldNames.length; i++) {
+      if 
(vectorFields.containsKey(fullFieldNames[i].toLowerCase(Locale.ROOT))) {
+        readFieldTypes[i] = 
DataTypes.of(bytesType(fullFieldTypes[i].getLogicalType()));
+      }
+    }
+    return readFieldTypes;
+  }
+
+  /**
+   * Wraps a row iterator and decodes VECTOR byte values into Flink array 
values.
+   *
+   * <p>{@code rowType} must describe the physical rows emitted by {@code 
rowDataItr}, where VECTOR
+   * columns have already been read as BYTES. Non-vector fields are copied 
with type-aware Flink
+   * field getters to preserve their original representation.
+   *
+   * @param rowDataItr       physical row iterator
+   * @param rowType          row type of the physical rows emitted by {@code 
rowDataItr}
+   * @param vectorColumnInfo projected VECTOR column metadata keyed by row 
ordinal
+   * @return iterator emitting rows with VECTOR columns decoded as arrays
+   */
+  public static ClosableIterator<RowData> wrapVectorColumnIterator(
+      ClosableIterator<RowData> rowDataItr,
+      RowType rowType,
+      Map<Integer, HoodieSchema.Vector> vectorColumnInfo) {
+    RowData.FieldGetter[] fieldGetters = createFieldGetters(rowType);
+    return new CloseableMappingIterator<>(
+        rowDataItr, rowData -> convertVectorColumns(rowData, fieldGetters, 
vectorColumnInfo));
+  }
+
+  /**
+   * Wraps a projected row iterator and decodes VECTOR byte values into Flink 
array values.
+   *
+   * <p>The selected physical row type is derived from {@code fullFieldTypes} 
and
+   * {@code selectedFields}, then delegated to {@link 
#wrapVectorColumnIterator(ClosableIterator, RowType, Map)}.
+   *
+   * @param rowDataItr       physical row iterator
+   * @param fullFieldTypes   field types before projection
+   * @param selectedFields   ordinals selected from {@code fullFieldTypes}
+   * @param vectorColumnInfo projected VECTOR column metadata keyed by row 
ordinal
+   * @return iterator emitting rows with VECTOR columns decoded as arrays
+   */
+  public static ClosableIterator<RowData> wrapVectorColumnIterator(
+      ClosableIterator<RowData> rowDataItr,
+      DataType[] fullFieldTypes,
+      int[] selectedFields,
+      Map<Integer, HoodieSchema.Vector> vectorColumnInfo) {
+    return wrapVectorColumnIterator(rowDataItr, createRowType(fullFieldTypes, 
selectedFields), vectorColumnInfo);
+  }
+
+  /**
+   * Returns VECTOR fields keyed by lower-cased field name.
+   */
+  private static Map<String, HoodieSchema.Vector> getVectorFields(HoodieSchema 
tableSchema) {
+    return tableSchema.getFields().stream()
+        .filter(field -> field.schema().getNonNullType().getType() == 
HoodieSchemaType.VECTOR)
+        .collect(Collectors.toMap(
+            field -> field.name().toLowerCase(Locale.ROOT),
+            field -> (HoodieSchema.Vector) field.schema().getNonNullType()));
+  }
+
+  /**
+   * Creates a BYTES logical type while preserving the original nullability.
+   */
+  private static LogicalType bytesType(LogicalType originalType) {
+    return DataTypes.BYTES().getLogicalType().copy(originalType.isNullable());
+  }
+
+  /**
+   * Creates a row field with a replacement logical type while preserving 
metadata.
+   */
+  private static RowType.RowField newRowField(RowType.RowField field, 
LogicalType type) {
+    return field.getDescription()
+        .map(description -> new RowType.RowField(field.getName(), type, 
description))
+        .orElseGet(() -> new RowType.RowField(field.getName(), type));
+  }
+
+  /**
+   * Converts VECTOR columns in one row from physical bytes to Flink array 
data.
+   */
+  private static RowData convertVectorColumns(
+      RowData rowData,
+      RowData.FieldGetter[] fieldGetters,
+      Map<Integer, HoodieSchema.Vector> vectorColumnInfo) {
+    GenericRowData converted = new GenericRowData(rowData.getArity());
+    converted.setRowKind(rowData.getRowKind());
+    for (int i = 0; i < rowData.getArity(); i++) {
+      if (rowData.isNullAt(i)) {
+        converted.setField(i, null);
+      } else if (vectorColumnInfo.containsKey(i)) {
+        converted.setField(i, createVectorArrayData(rowData.getBinary(i), 
vectorColumnInfo.get(i)));
+      } else {
+        converted.setField(i, fieldGetters[i].getFieldOrNull(rowData));
+      }
+    }
+    return converted;
+  }
+
+  /**
+   * Creates type-aware field getters for the physical row type.
+   */
+  private static RowData.FieldGetter[] createFieldGetters(RowType rowType) {
+    RowData.FieldGetter[] fieldGetters = new 
RowData.FieldGetter[rowType.getFieldCount()];
+    for (int i = 0; i < fieldGetters.length; i++) {
+      fieldGetters[i] = RowData.createFieldGetter(rowType.getTypeAt(i), i);
+    }
+    return fieldGetters;
+  }
+
+  /**
+   * Builds a projected row type from full field types and selected ordinals.
+   */
+  private static RowType createRowType(DataType[] fullFieldTypes, int[] 
selectedFields) {
+    List<RowType.RowField> rowFields = new ArrayList<>(selectedFields.length);
+    for (int i = 0; i < selectedFields.length; i++) {
+      rowFields.add(new RowType.RowField(String.valueOf(i), 
fullFieldTypes[selectedFields[i]].getLogicalType()));
+    }
+    return new RowType(rowFields);
+  }
+
+  /**
+   * Wraps a decoded primitive vector array as Flink array data.
+   */
+  public static GenericArrayData createVectorArrayData(byte[] bytes, 
HoodieSchema.Vector vectorSchema) {
+    Object vectorArray = HoodieVectorUtils.decodeVectorBytes(bytes, 
vectorSchema);
+    if (vectorArray instanceof float[]) {
+      return new GenericArrayData((float[]) vectorArray);
+    } else if (vectorArray instanceof double[]) {
+      return new GenericArrayData((double[]) vectorArray);
+    } else if (vectorArray instanceof byte[]) {
+      return new GenericArrayData((byte[]) vectorArray);
+    }
+    throw new UnsupportedOperationException("Unsupported decoded vector array 
type: " + vectorArray.getClass());
+  }
+
+  /**
+   * Encodes Flink array data into the canonical Hudi VECTOR fixed-bytes 
representation.
+   */
+  public static byte[] encodeVectorArrayData(ArrayData arrayData, 
HoodieSchema.Vector vectorSchema) {
+    int dimension = vectorSchema.getDimension();
+    HoodieSchema.Vector.VectorElementType elementType = 
vectorSchema.getVectorElementType();
+    ValidationUtils.checkArgument(arrayData.size() == dimension,
+        () -> "Vector dimension mismatch: schema expects " + dimension + " 
elements but got " + arrayData.size());
+    int bufferSize = Math.multiplyExact(dimension, 
elementType.getElementSize());
+    ByteBuffer buffer = 
ByteBuffer.allocate(bufferSize).order(HoodieSchema.VectorLogicalType.VECTOR_BYTE_ORDER);
+    switch (elementType) {
+      case FLOAT:
+        for (int i = 0; i < dimension; i++) {
+          buffer.putFloat(arrayData.getFloat(i));

Review Comment:
   🤖 `encodeVectorArrayData` doesn't check `arrayData.isNullAt(i)` before 
calling `getFloat/getDouble/getByte`. Since 
`HoodieSchemaConverter.convertVectorField` accepts `ARRAY<FLOAT>` with nullable 
elements (the default for Flink DDL `ARRAY<FLOAT>`), a row with a null element 
inside the vector would be silently encoded as `0.0` rather than failing or 
being detected. Worth either rejecting nullable element types in 
`inferVectorElementType` or guarding here with an explicit null check that 
throws.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(flink): Support writing VECTOR columns for flink writer [hudi]

Reply via email to