hudi-agent commented on code in PR #18599:
URL: https://github.com/apache/hudi/pull/18599#discussion_r3210593217
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java:
##########
@@ -293,6 +305,54 @@ private static Field rewriteBlobDataChild(Field
blobStructField) {
return new Field(blobStructField.getName(),
blobStructField.getFieldType(), rebuilt);
}
+ /**
+ * Single-pass walk that returns (a) the enriched schema with top-level
+ * {@code VariantType} fields replaced by Hudi's canonical
+ * {@code Struct[metadata: binary, value: binary]} (tagged {@code
hudi_type=VARIANT}
+ * so {@code HoodieSparkSchemaConverters} promotes it back on read), and (b)
the
+ * variant ordinals in ascending order. {@code LanceArrowUtils} has no
VariantType
+ * case, so we hand it a plain struct. Top-level only - nested variants are
not
+ * yet supported.
+ */
+ private static Pair<StructType, int[]> enrichForLanceVariant(StructType
sparkSchema) {
+ StructField[] fields = sparkSchema.fields();
+ StructField[] newFields = null;
+ List<Integer> ordinals = null;
+ for (int i = 0; i < fields.length; i++) {
+ StructField field = fields[i];
+ if
(!SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantType(field.dataType())) {
+ if (newFields != null) {
+ newFields[i] = field;
+ }
+ continue;
+ }
+ if (newFields == null) {
+ newFields = new StructField[fields.length];
+ System.arraycopy(fields, 0, newFields, 0, i);
+ ordinals = new ArrayList<>();
+ }
+ ordinals.add(i);
+ StructField metaField = new StructField(
+ HoodieSchema.Variant.VARIANT_METADATA_FIELD, DataTypes.BinaryType,
false, Metadata.empty());
+ StructField valField = new StructField(
+ HoodieSchema.Variant.VARIANT_VALUE_FIELD, DataTypes.BinaryType,
false, Metadata.empty());
+ StructType variantStruct = new StructType(new StructField[] {metaField,
valField});
+ Metadata enriched = new MetadataBuilder()
+ .withMetadata(field.metadata())
+ .putString(HoodieSchema.TYPE_METADATA_FIELD,
HoodieSchemaType.VARIANT.name())
+ .build();
+ newFields[i] = new StructField(field.name(), variantStruct,
field.nullable(), enriched);
+ }
+ if (newFields == null) {
+ return Pair.of(sparkSchema, EMPTY_INT_ARRAY);
+ }
+ int[] ordinalArr = new int[ordinals.size()];
+ for (int i = 0; i < ordinalArr.length; i++) {
Review Comment:
π€ nit: could you simplify this to
`ordinals.stream().mapToInt(Integer::intValue).toArray()`? The three-line loop
is a bit ceremonial for what's really just a Listβint[] conversion.
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -437,6 +438,29 @@ trait SparkAdapter extends Serializable {
writeMetadata: Consumer[Array[Byte]]
): BiConsumer[SpecializedGetters, Integer]
+ /**
+ * Creates a [[VariantProjectedRow]] for the Lance writer's variant
projection.
+ * Spark 3.x throws (no VariantType support); Spark 4.x returns a row that
delegates
+ * accessors to the wrapped input row, except at the configured variant
ordinals where
+ * it returns the pre-allocated {@code (metadata, value)} struct populated
by the
+ * matching extractor.
+ *
+ * Kept on the adapter so the shared {@code hudi-spark-client} module never
needs to
+ * reference Spark-4-only types like {@code VariantVal} / {@code
InternalRow.getVariant}.
+ *
+ * @param numFields total number of top-level fields in the
projection
+ * @param variantStructByOrdinal non-null only at variant ordinals;
pre-allocated
+ * {@code GenericInternalRow(new Object[2])}
+ * @param extractorByOrdinal non-null only at variant ordinals;
populates the
+ * corresponding entry in {@code
variantStructByOrdinal}
+ * @return a stateful [[VariantProjectedRow]] (Spark 4 only)
+ */
+ def createVariantProjectedRow(
+ numFields: Int,
+ variantStructByOrdinal: Array[GenericInternalRow],
+ extractorByOrdinal: java.util.List[BiConsumer[SpecializedGetters,
java.lang.Integer]]
Review Comment:
π€ nit: `variantStructByOrdinal` is an `Array` (random-access by ordinal) but
`extractorByOrdinal` is a `java.util.List` β both are indexed 0..numFields-1
and non-null only at variant positions. Could you make them the same type?
Using `Array` for both would let callers use consistent `[]` access, and would
also eliminate the null-fill loop in `SparkArrowWriter.of` (an array of
reference type is already null-initialized).
<sub><i>- AI-generated; verify before applying. React π/π to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]