voonhous opened a new issue, #18069:
URL: https://github.com/apache/hudi/issues/18069
### Task Description
**What needs to be done:**
Implement support for **UPDATE** and **MERGE** operations on tables with
shredded variant columns. Currently, these operations fail with a
`NullPointerException` in the Spark Avro deserializer when attempting to read
and merge existing variant data.
### Current Status
- Variant shredding works for INSERT operations **(DONE)**
- UPDATE operations fail with NPE during merge **(NOT DONE)**
- MERGE operations fail with NPE during merge **(NOT DONE)**
### Error Details
```
java.lang.NullPointerException
at
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:240)
at
org.apache.spark.sql.avro.HoodieSpark4_0AvroDeserializer.deserialize(HoodieSpark4_0AvroDeserializer.scala:30)
at
org.apache.hudi.common.model.HoodieAvroRecordMerger.merge(HoodieAvroRecordMerger.java:67)
at
org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:270)
```
### Required Changes
1. **Fix Schema Resolution in Merge Path**
- Ensure reader schema matches writer schema for variant fields
- Detect if existing data has shredded variants
- Pass correct variant shredding config to readers
- Files: `BaseSparkCommitActionExecutor.java`,
`FileGroupReaderBasedMergeHandle.java`
2. **Add Null Handling for Variant Fields**
- Add null checks for variant components (value, metadata, typed_value)
- Handle cases where typed_value is null (type mismatch)
- Handle cases where value is null (fully shredded)
- File: `HoodieSpark4_0AvroDeserializer.scala`
3. **Support Schema Evolution**
- Allow reading shredded data when shredding is disabled
- Allow reading unshredded data when shredding is enabled
- Add schema compatibility checks
- File: `HoodieAvroWriteSupport.java`
4. **Update Tests**
- Re-enable disabled tests:
- `Test Variant Shredding with Update Operation`
(TestVariantDataType.scala:220)
- `Test Variant Shredding with Merge Operation`
(TestVariantDataType.scala:280)
- Add schema evolution test cases
- Test mixed scenarios (shredded + unshredded records)
**Why this task is needed:**
Variant shredding is a key optimization for columnar storage of
semi-structured data, enabling efficient compression and query performance on
typed fields within variant columns (similar to Spark's variant shredding
feature). However, the current implementation only supports INSERT operations,
making it unusable for real-world scenarios where **UPDATE** and **MERGE**
operations are required.
### Task Type
Code improvement/refactoring
### Related Issues
**Parent feature issue:** (if applicable )
**Related issues:**
NOTE: Use `Relationships` button to add parent/blocking issues after issue
is created.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]