[I] Add Variant Shredding Support for UPDATE and MERGE Operations [hudi]

via GitHub Sat, 31 Jan 2026 23:04:43 -0800


voonhous opened a new issue, #18069:
URL: https://github.com/apache/hudi/issues/18069


   ### Task Description
   
   **What needs to be done:**
   
   Implement support for **UPDATE** and **MERGE** operations on tables with 
shredded variant columns. Currently, these operations fail with a 
`NullPointerException` in the Spark Avro deserializer when attempting to read 
and merge existing variant data.
   
   ### Current Status
   - Variant shredding works for INSERT operations **(DONE)**
   - UPDATE operations fail with NPE during merge **(NOT DONE)**
   - MERGE operations fail with NPE during merge **(NOT DONE)**
   
   ### Error Details
   ```
   java.lang.NullPointerException
       at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:240)
       at 
org.apache.spark.sql.avro.HoodieSpark4_0AvroDeserializer.deserialize(HoodieSpark4_0AvroDeserializer.scala:30)
       at 
org.apache.hudi.common.model.HoodieAvroRecordMerger.merge(HoodieAvroRecordMerger.java:67)
       at 
org.apache.hudi.io.FileGroupReaderBasedMergeHandle.doMerge(FileGroupReaderBasedMergeHandle.java:270)
   ```
   
   ### Required Changes
   
   1. **Fix Schema Resolution in Merge Path**
      - Ensure reader schema matches writer schema for variant fields
      - Detect if existing data has shredded variants
      - Pass correct variant shredding config to readers
      - Files: `BaseSparkCommitActionExecutor.java`, 
`FileGroupReaderBasedMergeHandle.java`
   
   2. **Add Null Handling for Variant Fields**
      - Add null checks for variant components (value, metadata, typed_value)
      - Handle cases where typed_value is null (type mismatch)
      - Handle cases where value is null (fully shredded)
      - File: `HoodieSpark4_0AvroDeserializer.scala`
   
   3. **Support Schema Evolution**
      - Allow reading shredded data when shredding is disabled
      - Allow reading unshredded data when shredding is enabled
      - Add schema compatibility checks
      - File: `HoodieAvroWriteSupport.java`
   
   4. **Update Tests**
      - Re-enable disabled tests:
        - `Test Variant Shredding with Update Operation` 
(TestVariantDataType.scala:220)
        - `Test Variant Shredding with Merge Operation` 
(TestVariantDataType.scala:280)
      - Add schema evolution test cases
      - Test mixed scenarios (shredded + unshredded records)
   
   
   **Why this task is needed:**
   Variant shredding is a key optimization for columnar storage of 
semi-structured data, enabling efficient compression and query performance on 
typed fields within variant columns (similar to Spark's variant shredding 
feature). However, the current implementation only supports INSERT operations, 
making it unusable for real-world scenarios where **UPDATE** and **MERGE** 
operations are required.
   
   
   ### Task Type
   
   Code improvement/refactoring
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Add Variant Shredding Support for UPDATE and MERGE Operations [hudi]

Reply via email to