Re: [PR] feat(lance): Add VariantType support to Lance base files [hudi]

via GitHub Fri, 08 May 2026 11:27:28 -0700


hudi-agent commented on code in PR #18599:
URL: https://github.com/apache/hudi/pull/18599#discussion_r3210593217



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -293,6 +305,54 @@ private static Field rewriteBlobDataChild(Field 
blobStructField) {
     return new Field(blobStructField.getName(), 
blobStructField.getFieldType(), rebuilt);
   }
 
+  /**
+   * Single-pass walk that returns (a) the enriched schema with top-level
+   * {@code VariantType} fields replaced by Hudi's canonical
+   * {@code Struct[metadata: binary, value: binary]} (tagged {@code 
hudi_type=VARIANT}
+   * so {@code HoodieSparkSchemaConverters} promotes it back on read), and (b) 
the
+   * variant ordinals in ascending order. {@code LanceArrowUtils} has no 
VariantType
+   * case, so we hand it a plain struct. Top-level only - nested variants are 
not
+   * yet supported.
+   */
+  private static Pair<StructType, int[]> enrichForLanceVariant(StructType 
sparkSchema) {
+    StructField[] fields = sparkSchema.fields();
+    StructField[] newFields = null;
+    List<Integer> ordinals = null;
+    for (int i = 0; i < fields.length; i++) {
+      StructField field = fields[i];
+      if 
(!SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantType(field.dataType())) {
+        if (newFields != null) {
+          newFields[i] = field;
+        }
+        continue;
+      }
+      if (newFields == null) {
+        newFields = new StructField[fields.length];
+        System.arraycopy(fields, 0, newFields, 0, i);
+        ordinals = new ArrayList<>();
+      }
+      ordinals.add(i);
+      StructField metaField = new StructField(
+          HoodieSchema.Variant.VARIANT_METADATA_FIELD, DataTypes.BinaryType, 
false, Metadata.empty());
+      StructField valField = new StructField(
+          HoodieSchema.Variant.VARIANT_VALUE_FIELD, DataTypes.BinaryType, 
false, Metadata.empty());
+      StructType variantStruct = new StructType(new StructField[] {metaField, 
valField});
+      Metadata enriched = new MetadataBuilder()
+          .withMetadata(field.metadata())
+          .putString(HoodieSchema.TYPE_METADATA_FIELD, 
HoodieSchemaType.VARIANT.name())
+          .build();
+      newFields[i] = new StructField(field.name(), variantStruct, 
field.nullable(), enriched);
+    }
+    if (newFields == null) {
+      return Pair.of(sparkSchema, EMPTY_INT_ARRAY);
+    }
+    int[] ordinalArr = new int[ordinals.size()];
+    for (int i = 0; i < ordinalArr.length; i++) {

Review Comment:
   🤖 nit: could you simplify this to 
`ordinals.stream().mapToInt(Integer::intValue).toArray()`? The three-line loop 
is a bit ceremonial for what's really just a List→int[] conversion.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -437,6 +438,29 @@ trait SparkAdapter extends Serializable {
     writeMetadata: Consumer[Array[Byte]]
   ): BiConsumer[SpecializedGetters, Integer]
 
+  /**
+   * Creates a [[VariantProjectedRow]] for the Lance writer's variant 
projection.
+   * Spark 3.x throws (no VariantType support); Spark 4.x returns a row that 
delegates
+   * accessors to the wrapped input row, except at the configured variant 
ordinals where
+   * it returns the pre-allocated {@code (metadata, value)} struct populated 
by the
+   * matching extractor.
+   *
+   * Kept on the adapter so the shared {@code hudi-spark-client} module never 
needs to
+   * reference Spark-4-only types like {@code VariantVal} / {@code 
InternalRow.getVariant}.
+   *
+   * @param numFields                total number of top-level fields in the 
projection
+   * @param variantStructByOrdinal   non-null only at variant ordinals; 
pre-allocated
+   *                                 {@code GenericInternalRow(new Object[2])}
+   * @param extractorByOrdinal       non-null only at variant ordinals; 
populates the
+   *                                 corresponding entry in {@code 
variantStructByOrdinal}
+   * @return a stateful [[VariantProjectedRow]] (Spark 4 only)
+   */
+  def createVariantProjectedRow(
+    numFields: Int,
+    variantStructByOrdinal: Array[GenericInternalRow],
+    extractorByOrdinal: java.util.List[BiConsumer[SpecializedGetters, 
java.lang.Integer]]

Review Comment:
   🤖 nit: `variantStructByOrdinal` is an `Array` (random-access by ordinal) but 
`extractorByOrdinal` is a `java.util.List` — both are indexed 0..numFields-1 
and non-null only at variant positions. Could you make them the same type? 
Using `Array` for both would let callers use consistent `[]` access, and would 
also eliminate the null-fill loop in `SparkArrowWriter.of` (an array of 
reference type is already null-initialized).
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Add VariantType support to Lance base files [hudi]

Reply via email to