yihua commented on code in PR #18036:
URL: https://github.com/apache/hudi/pull/18036#discussion_r3036105069
##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java:
##########
@@ -554,8 +554,7 @@ private static DataType convertUnion(HoodieSchema schema) {
/**
* Converts a Variant schema to Flink's ROW type.
- * Variant is represented as ROW<`metadata` BYTES, `value` BYTES> in Flink.
- * // TODO: We are only supporting unshredded for now, support shredded in
the future
+ * Variant is represented as ROW<`value` BYTES, `metadata` BYTES> in Flink.
Review Comment:
<a href="#"><img alt="P2"
src="https://greptile-static-assets.s3.amazonaws.com/badges/p2.svg?v=7"
align="top"></a> **Reversed field order in updated Javadoc**
The updated comment says `ROW<value BYTES, metadata BYTES>` (value first),
but the actual implementation at lines 569ā572 creates the Flink ROW with
`metadata` as the first field and `value` as the second:
```java
return DataTypes.ROW(
DataTypes.FIELD("metadata", DataTypes.BYTES().notNull()),
DataTypes.FIELD("value", DataTypes.BYTES().notNull())
).notNull();
```
The comment should read `ROW<metadata BYTES, value BYTES>` to match the
code. The previous comment was actually correct about the field order (it only
had the wrong intent about which field came first positionally).
```suggestion
* Variant is represented as ROW<`metadata` BYTES, `value` BYTES> in Flink.
```
ā *Greptile*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036097637))
(source:comment#3036097637)
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -391,6 +542,9 @@ private ValueWriter makeWriter(HoodieSchema schema,
DataType dataType) {
}
});
};
+ } else if (dataType instanceof StructType
+ &&
SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType)
dataType)) {
+ return makeShreddedVariantWriter((StructType) dataType);
Review Comment:
_ā ļø Potential issue_ | _š“ Critical_
**Preserve `HoodieSchema` when building shredded child writers.**
Line 571 recreates every shredded child writer with `null` schema, so
schema-only semantics are lost on the write path. A shredded Variant that
contains a millis timestamp, for example, will still be written through the
default micros branch here, while `convert(...)` derives the Parquet logical
type from the real `HoodieSchema`. That gives you a schema/data mismatch for
affected shredded variants.
<details>
<summary>Suggested fix</summary>
```diff
- } else if (dataType instanceof StructType
- &&
SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType)
dataType)) {
- return makeShreddedVariantWriter((StructType) dataType);
+ } else if (dataType instanceof StructType
+ &&
SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType)
dataType)) {
+ return makeShreddedVariantWriter((StructType) dataType,
resolvedSchema);
} else if (dataType instanceof StructType) {
```
```diff
- private ValueWriter makeShreddedVariantWriter(StructType
shreddedStructType) {
- // Create writers for the shredded struct fields
- // The shreddedStructType contains: metadata (binary), value (binary),
typed_value (optional)
- ValueWriter[] shreddedFieldWriters =
Arrays.stream(shreddedStructType.fields())
- .map(field -> makeWriter(null, field.dataType()))
- .toArray(ValueWriter[]::new);
+ private ValueWriter makeShreddedVariantWriter(StructType
shreddedStructType, HoodieSchema variantSchema) {
+ ValueWriter[] shreddedFieldWriters =
getFieldWriters(shreddedStructType, variantSchema);
```
</details>
Also applies to: 567-585
<details>
<summary>š¤ Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java`
around lines 545 - 547, HoodieSchema is being dropped when creating shredded
child writers, causing schema-only semantics (e.g., millis vs micros
timestamps)
to be lost; update the calls that build shredded writers in
HoodieRowParquetWriteSupport (notably makeShreddedVariantWriter and the other
shredded-child factory calls in the same branch range) to pass the current
HoodieSchema through to the writer factory (or add an overload that accepts
HoodieSchema), and update those factory/signature implementations so
convert(...) and any Parquet logical-type resolution use the provided
HoodieSchema rather than null; apply the same schema propagation to all
shredded
writer creation sites in the 567-585 region so timestamps and other logical
types are derived from the real HoodieSchema.
```
</details>
<!--
fingerprinting:phantom:medusa:grasshopper:d884afb5-4216-4148-afc5-ef93118769b9
-->
<!-- This is an auto-generated comment by CodeRabbit -->
ā *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036104555))
(source:comment#3036104555)
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -2463,6 +2569,60 @@ public Option<HoodieSchema> getTypedValueField() {
return typedValueSchema;
}
+ /**
+ * Returns the typed_value schema with plain (unwrapped) types suitable
for Spark shredding utilities, i.e. essentially removing the `value` field
+ *
+ * <p>If the typed_value follows the variant shredding spec (each field is
a struct with
+ * {@code {value: bytes, typed_value: <type>}}), this extracts only the
inner typed_value types and returns a record schema containing just those plain
types.</p>
+ *
+ * <p>If the typed_value is already in plain form (created with {@code
createVariantShredded}),
+ * returns the schema as-is.</p>
+ *
+ * @return Option containing the plain typed_value schema, or
Option.empty() if not present
+ */
+ public Option<HoodieSchema> getPlainTypedValueSchema() {
+ if (!typedValueSchema.isPresent()) {
+ return Option.empty();
+ }
+ HoodieSchema tvSchema = typedValueSchema.get();
+ if (tvSchema.getType() != HoodieSchemaType.RECORD) {
+ return typedValueSchema;
+ }
+
+ List<HoodieSchemaField> fields = tvSchema.getFields();
+ // Check if all fields follow the nested shredding pattern: each field
is a record with {value, typed_value}
+ boolean isNestedForm = !fields.isEmpty() &&
fields.stream().allMatch(field -> {
+ HoodieSchema fieldSchema = field.schema();
+ if (fieldSchema.isNullable()) {
+ fieldSchema = fieldSchema.getNonNullType();
+ }
+ if (fieldSchema.getType() != HoodieSchemaType.RECORD) {
+ return false;
+ }
+ Option<HoodieSchemaField> valueSubField =
fieldSchema.getField(VARIANT_VALUE_FIELD);
+ Option<HoodieSchemaField> typedValueSubField =
fieldSchema.getField(VARIANT_TYPED_VALUE_FIELD);
+ return valueSubField.isPresent() && typedValueSubField.isPresent()
+ && fieldSchema.getFields().size() == 2;
+ });
Review Comment:
_ā ļø Potential issue_ | _š Major_
**Tighten the wrapper detection before flattening `typed_value`.**
`isNestedForm` only checks field names and field count. Any ordinary record
whose fields happen to be `value` and `typed_value` will be unwrapped here and
lose its `value` member. Check the actual shredded-wrapper contract instead of
names alone.
<details>
<summary>ā»ļø Suggested tightening</summary>
```diff
- boolean isNestedForm = !fields.isEmpty() &&
fields.stream().allMatch(field -> {
+ boolean isNestedForm = !fields.isEmpty() &&
fields.stream().allMatch(field -> {
HoodieSchema fieldSchema = field.schema();
if (fieldSchema.isNullable()) {
fieldSchema = fieldSchema.getNonNullType();
}
if (fieldSchema.getType() != HoodieSchemaType.RECORD) {
return false;
}
Option<HoodieSchemaField> valueSubField =
fieldSchema.getField(VARIANT_VALUE_FIELD);
Option<HoodieSchemaField> typedValueSubField =
fieldSchema.getField(VARIANT_TYPED_VALUE_FIELD);
- return valueSubField.isPresent() && typedValueSubField.isPresent()
- && fieldSchema.getFields().size() == 2;
+ return valueSubField.isPresent()
+ && valueSubField.get().schema().isNullable()
+ && valueSubField.get().schema().getNonNullType().getType() ==
HoodieSchemaType.BYTES
+ && typedValueSubField.isPresent()
+ && typedValueSubField.get().schema().isNullable()
+ && fieldSchema.getFields().size() == 2;
});
```
</details>
<!-- suggestion_start -->
<details>
<summary>š Committable suggestion</summary>
> ā¼ļø **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately
replaces the highlighted code, contains no missing lines, and has no issues
with indentation. Thoroughly test & benchmark the code to ensure it meets the
requirements.
```suggestion
List<HoodieSchemaField> fields = tvSchema.getFields();
// Check if all fields follow the nested shredding pattern: each field
is a record with {value, typed_value}
boolean isNestedForm = !fields.isEmpty() &&
fields.stream().allMatch(field -> {
HoodieSchema fieldSchema = field.schema();
if (fieldSchema.isNullable()) {
fieldSchema = fieldSchema.getNonNullType();
}
if (fieldSchema.getType() != HoodieSchemaType.RECORD) {
return false;
}
Option<HoodieSchemaField> valueSubField =
fieldSchema.getField(VARIANT_VALUE_FIELD);
Option<HoodieSchemaField> typedValueSubField =
fieldSchema.getField(VARIANT_TYPED_VALUE_FIELD);
return valueSubField.isPresent()
&& valueSubField.get().schema().isNullable()
&& valueSubField.get().schema().getNonNullType().getType() ==
HoodieSchemaType.BYTES
&& typedValueSubField.isPresent()
&& typedValueSubField.get().schema().isNullable()
&& fieldSchema.getFields().size() == 2;
});
```
</details>
<!-- suggestion_end -->
<!--
fingerprinting:phantom:medusa:grasshopper:17576896-9233-4f28-812a-0d5a3e064d5f
-->
<!-- This is an auto-generated comment by CodeRabbit -->
ā *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036104560))
(source:comment#3036104560)
##########
hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java:
##########
@@ -155,9 +190,71 @@ public void testWriteShreddedVariant() throws IOException {
assertEquals(BINARY, valueField.asPrimitiveType().getPrimitiveTypeName());
assertEquals(OPTIONAL, valueField.getRepetition(), "Shredded variant value
must be OPTIONAL");
- // Verify typed_value is omitted (as implementation skips it)
boolean hasTypedValue = vGroup.getFields().stream().anyMatch(f ->
f.getName().equals("typed_value"));
- assertFalse(hasTypedValue, "typed_value field should be omitted in this
writer implementation");
+ assertTrue(hasTypedValue, "typed_value field should be omitted in this
writer implementation");
Review Comment:
_ā ļø Potential issue_ | _š” Minor_
**Fix the inverted assertion message.**
This assertion expects `typed_value` to exist, but the failure message says
it should be omitted. If it trips, the output will be misleading.
<details>
<summary>Suggested fix</summary>
```diff
- assertTrue(hasTypedValue, "typed_value field should be omitted in this
writer implementation");
+ assertTrue(hasTypedValue, "typed_value field should be present in
shredded schemas");
```
</details>
<!-- suggestion_start -->
<details>
<summary>š Committable suggestion</summary>
> ā¼ļø **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately
replaces the highlighted code, contains no missing lines, and has no issues
with indentation. Thoroughly test & benchmark the code to ensure it meets the
requirements.
```suggestion
boolean hasTypedValue = vGroup.getFields().stream().anyMatch(f ->
f.getName().equals("typed_value"));
assertTrue(hasTypedValue, "typed_value field should be present in
shredded schemas");
```
</details>
<!-- suggestion_end -->
<details>
<summary>š¤ Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java`
around lines 193 - 194, The assertion is inverted: currently it uses
assertTrue(hasTypedValue, ...) but the message says "typed_value field
should be
omitted"; change the assertion to assertFalse(hasTypedValue, "typed_value
field
should be omitted in this writer implementation") so it correctly fails with
the
existing message (or alternatively keep assertTrue and update the message to
reflect presence). Update the check around vGroup and the local boolean
hasTypedValue accordingly.
```
</details>
<!--
fingerprinting:phantom:medusa:grasshopper:d884afb5-4216-4148-afc5-ef93118769b9
-->
<!-- This is an auto-generated comment by CodeRabbit -->
ā *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036104566))
(source:comment#3036104566)
##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java:
##########
@@ -554,8 +554,7 @@ private static DataType convertUnion(HoodieSchema schema) {
/**
* Converts a Variant schema to Flink's ROW type.
- * Variant is represented as ROW<`metadata` BYTES, `value` BYTES> in Flink.
- * // TODO: We are only supporting unshredded for now, support shredded in
the future
+ * Variant is represented as ROW<`value` BYTES, `metadata` BYTES> in Flink.
Review Comment:
_ā ļø Potential issue_ | _š” Minor_
**Fix Variant field-order Javadoc to match the actual schema order.**
Line 557 documents `ROW<value BYTES, metadata BYTES>`, but `convertVariant`
builds `metadata` then `value` (Line 570 then Line 571). This also conflicts
with
`hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:680-710`
and tests in
`hudi-client/hudi-flink-client/src/test/java/org/apache/hudi/util/TestHoodieSchemaConverter.java:641-678`
and
`hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestVariantCrossEngineCompatibility.java:60-105`.
<details>
<summary>Suggested doc fix</summary>
```diff
- * Variant is represented as ROW<`value` BYTES, `metadata` BYTES> in
Flink.
+ * Variant is represented as ROW<metadata BYTES, `value` BYTES> in Flink.
```
</details>
<!-- suggestion_start -->
<details>
<summary>š Committable suggestion</summary>
> ā¼ļø **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately
replaces the highlighted code, contains no missing lines, and has no issues
with indentation. Thoroughly test & benchmark the code to ensure it meets the
requirements.
```suggestion
* Variant is represented as ROW<metadata BYTES, `value` BYTES> in Flink.
```
</details>
<!-- suggestion_end -->
<details>
<summary>š¤ Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java`
at line 557, Update the Javadoc for the Variant field-order in
HoodieSchemaConverter to reflect the actual order used by convertVariant:
change
the documented schema from ROW<value BYTES, metadata BYTES> to ROW<metadata
BYTES, value BYTES> so it matches the implementation in convertVariant and
the
canonical order in HoodieSchema and related tests; verify the comment near
the
convertVariant method and any other Javadoc mentions in HoodieSchemaConverter
mention metadata first then value.
```
</details>
<!--
fingerprinting:phantom:poseidon:hawk:554bb875-cb9a-4d20-8fbc-a68f96f5bfda -->
<!-- This is an auto-generated comment by CodeRabbit -->
ā *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036104552))
(source:comment#3036104552)
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java:
##########
@@ -866,6 +866,112 @@ public static HoodieSchema.Vector createVector(String
name, int dimension, Vecto
return new HoodieSchema.Vector(vectorSchema);
}
+ /**
+ * Creates a shredded field struct per the Parquet variant shredding spec.
The returned struct contains two nullable fields:
+ * <ul>
+ * <li>{@code value}: nullable bytes (fallback binary representation)</li>
+ * <li>{@code typed_value}: nullable type (the typed representation)</li>
+ * </ul>
+ *
+ * <p>Example output structure:
+ * <pre>
+ * fieldName: struct
+ * |-- value: binary (nullable)
+ * |-- typed_value: <fieldType> (nullable)
+ * </pre></p>
+ *
+ * @param fieldName the name for the record (used as the Avro record name)
+ * @param fieldType the schema for the typed_value within this field
+ * @return a new HoodieSchema representing the shredded field struct
+ */
+ public static HoodieSchema createShreddedFieldStruct(String fieldName,
HoodieSchema fieldType) {
+ ValidationUtils.checkArgument(fieldName != null && !fieldName.isEmpty(),
"Field name cannot be null or empty");
+ ValidationUtils.checkArgument(fieldType != null, "Field type cannot be
null");
+ List<HoodieSchemaField> fields = Arrays.asList(
+ HoodieSchemaField.of(
+ Variant.VARIANT_VALUE_FIELD,
+ HoodieSchema.createNullable(HoodieSchemaType.BYTES),
+ "Fallback binary representation",
+ NULL_VALUE
+ ),
+ HoodieSchemaField.of(
+ Variant.VARIANT_TYPED_VALUE_FIELD,
+ HoodieSchema.createNullable(fieldType),
+ "Typed value representation",
+ NULL_VALUE
+ )
+ );
+ return HoodieSchema.createRecord(fieldName, null, null, fields);
+ }
Review Comment:
_ā ļø Potential issue_ | _š Major_
<details>
<summary>š§© Analysis chain</summary>
š Script executed:
```shell
find . -type f -name "HoodieSchema.java" | head -5
```
Repository: yihua/hudi
Length of output: 129
---
š Script executed:
```shell
cat -n
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java | sed
-n '887,905p'
```
Repository: yihua/hudi
Length of output: 1036
---
š Script executed:
```shell
cat -n
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java | sed
-n '953,972p'
```
Repository: yihua/hudi
Length of output: 1223
---
š Script executed:
```shell
rg "public static.*createRecord"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java -A 3
```
Repository: yihua/hudi
Length of output: 562
---
š Script executed:
```shell
rg "Variant\\.VARIANT_TYPED_VALUE_FIELD|Variant\\.VARIANT_VALUE_FIELD"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java |
head -20
```
Repository: yihua/hudi
Length of output: 442
---
š Script executed:
```shell
rg "createShreddedFieldStruct|createVariantShreddedObject"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java -B 2
-A 2
```
Repository: yihua/hudi
Length of output: 1578
---
š Script executed:
```shell
find . -type f -name "*Test*.java" | xargs grep -l
"createShreddedFieldStruct\|createVariantShreddedObject" 2>/dev/null | head -5
```
Repository: yihua/hudi
Length of output: 263
---
š Script executed:
```shell
rg "createShreddedFieldStruct|createVariantShreddedObject"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
-B 5 -A 10
```
Repository: yihua/hudi
Length of output: 4581
---
š Script executed:
```shell
rg "createShreddedFieldStruct|createVariantShreddedObject"
./hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
-B 5 -A 10
```
Repository: yihua/hudi
Length of output: 7562
---
š Script executed:
```shell
# Check Avro schema naming rules - look for any comments or documentation
about named type collisions
rg "named.*type|duplicate.*record|global.*name|Avro.*schema"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java |
head -20
```
Repository: yihua/hudi
Length of output: 1519
---
š Script executed:
```shell
# Get the constant values for field names
rg "VARIANT_VALUE_FIELD|VARIANT_TYPED_VALUE_FIELD"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java |
head -10
```
Repository: yihua/hudi
Length of output: 634
---
š Script executed:
```shell
# Look for tests with multiple variant fields at same level
rg "\.put\("
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| grep -i shredded -A 2 -B 2
```
Repository: yihua/hudi
Length of output: 696
---
š Script executed:
```shell
# Check if there are tests with two variants in same record
grep -n "createVariantShreddedObject"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| wc -l
```
Repository: yihua/hudi
Length of output: 55
---
š Script executed:
```shell
# Look for the actual Avro schema generation - check the toString() output
to see if named types are duplicated
sed -n '1,100p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| head -50
```
Repository: yihua/hudi
Length of output: 2069
---
š Script executed:
```shell
# Check if there's any test creating a record with multiple variant fields
rg "createRecord.*HoodieSchemaField.*variant"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
-B 3 -A 3
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Check for Avro native schema behavior - look at what getAvroSchema()
returns for a shredded field struct
cat -n
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java | sed
-n '1,100p'
```
Repository: yihua/hudi
Length of output: 4547
---
š Script executed:
```shell
# Look at what the namespace argument actually does in the signature
cat -n
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java | sed
-n '700,750p'
```
Repository: yihua/hudi
Length of output: 2843
---
š Script executed:
```shell
# Check for tests with multiple variant fields in the same record structure
grep -n "createVariantShreddedObject"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| head -20
```
Repository: yihua/hudi
Length of output: 507
---
š Script executed:
```shell
# Look more carefully at how field names are used as record names
cat -n
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java | sed
-n '887,975p'
```
Repository: yihua/hudi
Length of output: 4684
---
š Script executed:
```shell
# Check if there's path-based naming already used elsewhere in the code
rg "namespace|path"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java |
grep -i "name\|path\|fully" | head -20
```
Repository: yihua/hudi
Length of output: 1665
---
š Script executed:
```shell
# Search for how schemas are serialized - what happens when you have
duplicate named types
rg "toString\|toJSON\|serialize"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java |
head -10
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Look at test that does round-trip to see if named type collisions would be
caught
grep -n "roundTrip\|parse\|toString"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| head -20
```
Repository: yihua/hudi
Length of output: 1284
---
š Script executed:
```shell
# Let me check if there's a test case that would catch the namespace bug at
line 970-971
sed -n '2950,2980p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
```
Repository: yihua/hudi
Length of output: 1461
---
š Script executed:
```shell
# Let's see what testCreateVariantShreddedObjectWithCustomName actually tests
sed -n '2945,2975p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
```
Repository: yihua/hudi
Length of output: 1437
---
š Script executed:
```shell
# Check if namespace is properly propagated in the test
rg "getNamespace\|namespace"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| grep -A 2 -B 2 "Shredded"
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Generate a test case to see what actually gets created - check the avro
schema output
cat > /tmp/test_collision.py << 'EOF'
# Simulate what happens with the namespace bug
params = {
"name": "typed_value",
"namespace": None,
"doc": "org.apache.hudi", # BUG: namespace passed as doc
"fields": []
}
print(f"Expected: name=typed_value, namespace='org.apache.hudi', doc=None")
print(f"Actual: name={params['name']}, namespace={params['namespace']},
doc={params['doc']}")
print()
print("Impact: The typed_value record will have NO namespace, even though
namespace was intended")
EOF
python /tmp/test_collision.py
```
Repository: yihua/hudi
Length of output: 272
---
š Script executed:
```shell
# Check what the actual Avro Schema.createRecord signature expects
rg "Schema.createRecord"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java -A 1
| head -20
```
Repository: yihua/hudi
Length of output: 1098
---
š Script executed:
```shell
# Let me verify the exact issue - check what the inner typed_value record
looks like in the generated schema
sed -n '2970,3010p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
```
Repository: yihua/hudi
Length of output: 2035
---
š Script executed:
```shell
# Look for any test that checks the namespace of the typed_value field
rg "typed_value.*namespace\|getField.*typed_value"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
-B 3 -A 3
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Let me check if there's actually a test verifying the namespace on nested
structures
sed -n '2999,3030p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
```
Repository: yihua/hudi
Length of output: 1833
---
š Script executed:
```shell
# Check if the namespace parameter affects the typed_value record by looking
at the actual Avro output
cat > /tmp/test_ns.py << 'EOF'
# In line 970-971:
# HoodieSchema.createRecord(Variant.VARIANT_TYPED_VALUE_FIELD, null,
namespace, typedValueFields)
# Maps to createRecord(String name, String namespace, String doc, ...)
# Which becomes: createRecord(name, doc, namespace, false, ...)
# So: name="typed_value", doc=null, namespace=namespace (the param)
# This LOOKS CORRECT when you trace the conversion!
# Wait - let me check the parameter order in the overload:
# First overload: createRecord(String name, String namespace, String doc,
List fields)
# This calls: createRecord(name, doc, namespace, false, fields)
# Note the REORDERING: (name, doc, namespace, false, fields)
# So in line 970-971:
# createRecord(VARIANT_TYPED_VALUE_FIELD, null, namespace, typedValueFields)
# Maps to the first overload:
# - name = VARIANT_TYPED_VALUE_FIELD
# - namespace = null <-- THIS IS THE SECOND PARAM
# - doc = namespace <-- THIS IS THE THIRD PARAM
# Then it calls: createRecord(VARIANT_TYPED_VALUE_FIELD, namespace, null,
false, typedValueFields)
# Which becomes: createRecord(name, doc, namespace, false, fields)
# Final: name=VARIANT_TYPED_VALUE_FIELD, doc=namespace, namespace=null
# CONFIRMED: namespace goes to doc slot, null goes to namespace slot!
print("BUG CONFIRMED: Line 970-971 passes arguments in wrong order")
print("The 'namespace' parameter ends up in the 'doc' slot")
print("The 'null' literal ends up in the 'namespace' slot")
EOF
python /tmp/test_ns.py
```
Repository: yihua/hudi
Length of output: 216
---
š Script executed:
```shell
# Look for a test with multiple variant fields in same record
grep -n "HoodieSchemaField.of.*variant\|createVariantShreddedObject"
./hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
| head -30
```
Repository: yihua/hudi
Length of output: 1854
---
š Script executed:
```shell
# Check if there are any tests creating multiple variant fields at sibling
level
sed -n '1,50p'
./hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
```
Repository: yihua/hudi
Length of output: 2362
---
š Script executed:
```shell
# Check if any test validates the namespace of the inner typed_value record
grep -n "getField.*typed_value\|getNamespace"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| grep -A 5 -B 5 "typed_value"
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Let's verify if the bug actually breaks anything - check what happens to
the typed_value record with namespace param
cat > /tmp/verify_bug.py << 'EOF'
# Simulate the createRecord call at line 970-971
# First overload: createRecord(String name, String namespace, String doc,
List fields)
# Implementation calls: createRecord(name, doc, namespace, false, fields)
# So when calling: createRecord("typed_value", null, "org.apache.hudi",
typedValueFields)
# Maps to: createRecord(name="typed_value", namespace=null,
doc="org.apache.hudi", fields=typedValueFields)
# Implementation does: createRecord("typed_value", "org.apache.hudi", null,
false, typedValueFields)
# Then calls Avro: Schema.createRecord(name, doc, namespace, isError)
# Which is: Schema.createRecord("typed_value", "org.apache.hudi", null,
false)
# Result: name="typed_value", doc="org.apache.hudi", namespace=null
# So the typed_value record LOSES its namespace and gets the namespace as
doc instead
# Avro fully qualified name: namespace.name
# With bug: null."typed_value" = just "typed_value"
# Without bug: "org.apache.hudi"."typed_value" =
"org.apache.hudi.typed_value"
# This MATTERS because:
# 1. Named types in Avro are resolved by fully qualified name
# 2. Without namespace, "typed_value" is in the default namespace
# 3. If you have multiple variants with typed_value records, they all map to
the same FQN
# 4. Avro will deduplicate/reuse the same named type definition
# 5. This could cause issues with schema merging or validation
print("BUG IMPACT: typed_value records lose their intended namespace")
print("This could cause named type collision/reuse across multiple variants")
EOF
python /tmp/verify_bug.py
```
Repository: yihua/hudi
Length of output: 184
---
š Script executed:
```shell
# Check if there's a test that creates multiple variant fields in a single
record
rg
"HoodieSchemaField.of.*createVariantShreddedObject|HoodieSchemaField.of.*variant.*HoodieSchemaField.of.*variant"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
-B 5
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Look for any collision detection in the schema or parser code
rg "duplicate\|collision\|uniqueness\|already.*defined"
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Check if any real code or tests create multiple variant fields in same
record
rg "variant.*variant"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| head -10
```
Repository: yihua/hudi
Length of output: 824
---
š Script executed:
```shell
# Let me look at the full scope/nesting context - are field structs at
different scopes or same?
cat > /tmp/analyze_scope.py << 'EOF'
# Structure from line 953-972:
# createVariantShreddedObject creates:
# {
# "metadata": bytes,
# "value": binary,
# "typed_value": {
# "a": { "value": binary, "typed_value": <INT> },
# "b": { "value": binary, "typed_value": <STRING> },
# "c": { "value": binary, "typed_value": <DECIMAL> }
# }
# }
# Named types created:
# 1. Variant record (via createVariantShredded): name="variant", namespace
from param
# 2. typed_value record (line 970): name="typed_value", namespace=NULL
(bug), should be from param
# 3. a struct (via createShreddedFieldStruct): name="a", namespace=null
# 4. b struct: name="b", namespace=null
# 5. c struct: name="c", namespace=null
# Avro named type fully qualified names:
# 1. "variant" (or "org.apache.hudi.variant" if namespace provided)
# 2. "typed_value" (SHOULD be "org.apache.hudi.typed_value" but namespace
bug makes it just "typed_value")
# 3. "a" (in default namespace)
# 4. "b" (in default namespace)
# 5. "c" (in default namespace)
# COLLISION SCENARIO:
# If you have TWO variant fields in same parent record:
# Field v1 -> variant -> typed_value -> { a, b, c }
# Field v2 -> variant -> typed_value -> { x, y, z }
#
# Both create a named record called "typed_value" with no namespace
# These could be seen as the same named type by Avro, causing collision
# The field-level structs (a, b, c, x, y, z) are at different scopes within
# their respective typed_value records, so they wouldn't collide
print("ACTUAL COLLISION RISK:")
print("1. Multiple variant fields in same record would create duplicate
'typed_value' records")
print("2. The inner field structs (a, b, c) are scoped so wouldn't collide")
print("3. The PRIMARY bug is at line 970: namespace parameter ignored for
typed_value")
print("4. SECONDARY concern: field structs also get no namespace (line 904)")
EOF
python /tmp/analyze_scope.py
```
Repository: yihua/hudi
Length of output: 379
---
š Script executed:
```shell
# Verify if Avro actually enforces global uniqueness for named types - check
what Avro version is used
rg "avro" hudi-common/pom.xml | head -5
```
Repository: yihua/hudi
Length of output: 350
---
š Script executed:
```shell
# Let me check what happens when you serialize/deserialize multiple variants
cat > /tmp/test_multi_variant.py << 'EOF'
# If you have a record like:
# {
# "v1": variant with shredded {a, b},
# "v2": variant with shredded {c, d}
# }
#
# Avro schema would have:
# v1 -> variant record -> typed_value record
# v2 -> variant record -> typed_value record
#
# Both typed_value records have the same name but no namespace
# In Avro's named type registry, they might collide
# Let me check if the tests actually serialize/deserialize multiple variants
print("Need to check if round-trip works with multiple variants in same
record")
EOF
python /tmp/test_multi_variant.py
```
Repository: yihua/hudi
Length of output: 125
---
š Script executed:
```shell
# Check if there are any tests with multiple variants in same schema
grep -n "variant.*variant"
./hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
| head -5
```
Repository: yihua/hudi
Length of output: 546
---
š Script executed:
```shell
# Look at the actual namespace handling - does it preserve namespace in
field structs?
sed -n '960,972p'
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java
```
Repository: yihua/hudi
Length of output: 622
---
š Script executed:
```shell
# Let me trace through one more time to be absolutely certain about the bug
cat > /tmp/final_verification.py << 'EOF'
# Line 970-971 call:
# createRecord(Variant.VARIANT_TYPED_VALUE_FIELD, null, namespace,
typedValueFields)
# Method signature (line from earlier output):
# public static HoodieSchema createRecord(String name, String namespace,
String doc, List<HoodieSchemaField> fields)
# Parameter mapping:
call_args = ["Variant.VARIANT_TYPED_VALUE_FIELD", "null", "namespace",
"typedValueFields"]
method_params = ["name", "namespace", "doc", "fields"]
print("Call: createRecord(VARIANT_TYPED_VALUE_FIELD, null, namespace,
typedValueFields)")
print("Signature: createRecord(String name, String namespace, String doc,
List fields)")
print()
for i, (call, param) in enumerate(zip(call_args, method_params)):
print(f"Argument {i+1}: {call:40s} ā parameter: {param}")
print("\nFirst overload then calls: createRecord(name, doc, namespace,
false, fields)")
print("Which becomes: createRecord(VARIANT_TYPED_VALUE_FIELD, namespace,
null, false, typedValueFields)")
print()
# Now map to second overload
second_call_args = ["VARIANT_TYPED_VALUE_FIELD", "namespace", "null",
"false", "typedValueFields"]
second_method_params = ["name", "doc", "namespace", "isError", "fields"]
print("Second overload signature: createRecord(String name, String doc,
String namespace, boolean isError, List fields)")
print()
for i, (call, param) in enumerate(zip(second_call_args,
second_method_params)):
correct = "ā" if (i==0 and call=="VARIANT_TYPED_VALUE_FIELD" and
param=="name") or \
(i==1 and call=="namespace" and param=="doc") or \
(i==2 and call=="null" and param=="namespace") or \
(i==3 and call=="false" and param=="isError") or \
(i==4 and call=="typedValueFields" and
param=="fields") else "ā"
print(f"Argument {i+1}: {call:40s} ā parameter: {param:12s} {correct}")
print("\nā ļø BUG CONFIRMED: 'namespace' ends up in 'doc' slot, 'null' ends
up in 'namespace' slot")
EOF
python /tmp/final_verification.py
```
Repository: yihua/hudi
Length of output: 1290
---
š Script executed:
```shell
# Check if there's a test that validates nested variant namespace
preservation
grep -n "testCreateVariantShreddedObjectWithCustomName" -A 20
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| head -25
```
Repository: yihua/hudi
Length of output: 1194
---
š Script executed:
```shell
# Does the test check the typed_value field's namespace?
sed -n '2954,2974p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
```
Repository: yihua/hudi
Length of output: 1012
---
š Script executed:
```shell
# Check if the test validates the namespace of the nested typed_value record
grep -n "getTypedValueField\|namespace"
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
| grep -A 3 -B 3 "2954\|2970\|testCreateVariantShreddedObjectWithCustomName"
```
Repository: yihua/hudi
Length of output: 36
---
š Script executed:
```shell
# Let me verify: does testCreateVariantShreddedObjectWithCustomName check
typed_value namespace?
sed -n '2950,2962p'
./hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
```
Repository: yihua/hudi
Length of output: 709
---
</details>
**Fix namespace argument order for the `typed_value` record at line
970ā971.**
The `namespace` parameter is passed as the third argument (mapped to `doc`)
instead of the second (mapped to `namespace`):
```java
HoodieSchema typedValueSchema = HoodieSchema.createRecord(
Variant.VARIANT_TYPED_VALUE_FIELD, null, namespace, typedValueFields);
```
Should be:
```java
HoodieSchema typedValueSchema = HoodieSchema.createRecord(
Variant.VARIANT_TYPED_VALUE_FIELD, namespace, null, typedValueFields);
```
When `createVariantShreddedObject` is called with a `namespace` parameter,
the `typed_value` record loses it. Multiple variant fields in the same parent
record will both create unqualified `typed_value` records, causing schema
collision during serialization or round-trip parsing.
<details>
<summary>š¤ Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java`
around lines 887 - 905, The typed_value record is being constructed with the
namespace and doc arguments swapped, causing the namespace to be lost; locate
the creation of typedValueSchema (in createVariantShreddedObject / the block
that calls HoodieSchema.createRecord with Variant.VARIANT_TYPED_VALUE_FIELD)
and
swap the second and third arguments so the namespace is passed as the second
parameter and doc as the third (i.e., call
HoodieSchema.createRecord(Variant.VARIANT_TYPED_VALUE_FIELD, namespace, null,
typedValueFields) instead of
HoodieSchema.createRecord(Variant.VARIANT_TYPED_VALUE_FIELD, null, namespace,
typedValueFields)) to preserve qualification and avoid schema collisions.
```
</details>
<!--
fingerprinting:phantom:medusa:grasshopper:17576896-9233-4f28-812a-0d5a3e064d5f
-->
<!-- This is an auto-generated comment by CodeRabbit -->
ā *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036104556))
(source:comment#3036104556)
##########
hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##########
@@ -253,10 +253,11 @@ private[sql] class AvroDeserializer(rootAvroType: Schema,
val decimal = createDecimal(bigDecimal, d.getPrecision, d.getScale)
updater.setDecimal(ordinal, decimal)
- case (RECORD, VariantType) if avroType.getProp("logicalType") ==
HoodieSchema.VARIANT_TYPE_NAME =>
+ case (RECORD, VariantType) if avroType.getLogicalType != null
+ && avroType.getLogicalType.getName == HoodieSchema.VARIANT_TYPE_NAME =>
Review Comment:
_ā ļø Potential issue_ | _š Major_
**Verify Avro logical type handling and consider adding a fallback property
check.**
The change to use `Schema.getLogicalType()` depends on the underlying Avro
version's behavior. According to Apache Avro documentation, `getLogicalType()`
returns a `LogicalType` instance only if:
1. A `LogicalTypeFactory` for that type name was registered before the
schema was parsed, OR
2. The specific Avro version automatically constructs it from the schema's
`logicalType` property
Using a compile-time constant reference (`HoodieSchema.VARIANT_TYPE_NAME`)
does not guarantee the `HoodieSchema` static initialization block runs before
deserialization, which means custom logical type factories may not be
registered when this code executes on schemas parsed externally (e.g., via
plain `Schema.Parser`).
Suggest confirming the Avro version's behavior with a regression test that
parses Variant schemas via standard Avro parsing before Hudi's custom
registration. If `getLogicalType()` can return null in that scenario, add the
`getProp("logicalType")` fallback:
```diff
case (RECORD, VariantType) if
+ (avroType.getLogicalType != null
+ && avroType.getLogicalType.getName ==
HoodieSchema.VARIANT_TYPE_NAME) ||
- avroType.getLogicalType != null
- && avroType.getLogicalType.getName ==
HoodieSchema.VARIANT_TYPE_NAME =>
+ HoodieSchema.VARIANT_TYPE_NAME == avroType.getProp("logicalType") =>
```
<!-- suggestion_start -->
<details>
<summary>š Committable suggestion</summary>
> ā¼ļø **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately
replaces the highlighted code, contains no missing lines, and has no issues
with indentation. Thoroughly test & benchmark the code to ensure it meets the
requirements.
```suggestion
case (RECORD, VariantType) if
(avroType.getLogicalType != null
&& avroType.getLogicalType.getName ==
HoodieSchema.VARIANT_TYPE_NAME) ||
HoodieSchema.VARIANT_TYPE_NAME == avroType.getProp("logicalType") =>
```
</details>
<!-- suggestion_end -->
<details>
<summary>š¤ Prompt for AI Agents</summary>
```
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala`
around lines 256 - 257, The pattern in AvroDeserializer.scala that checks
avroType.getLogicalType may miss schemas parsed before Hudi's logical-type
factory registration; update the case (RECORD, VariantType) branch to also
check
avroType.getProp("logicalType") as a fallback (e.g., compare that string to
HoodieSchema.VARIANT_TYPE_NAME) so Variant schemas parsed via plain
Schema.Parser are handled; ensure the change is applied in the matching code
path that currently uses avroType.getLogicalType and add/update a regression
test that parses a Variant logicalType via Schema.Parser before Hudi
registration to validate the fallback behavior.
```
</details>
<!--
fingerprinting:phantom:medusa:grasshopper:17576896-9233-4f28-812a-0d5a3e064d5f
-->
<!-- This is an auto-generated comment by CodeRabbit -->
ā *CodeRabbit*
([original](https://github.com/yihua/hudi/pull/12#discussion_r3036104564))
(source:comment#3036104564)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]