[PR] fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation [hudi]

via GitHub Sat, 03 Jan 2026 23:10:49 -0800


prashantwason opened a new pull request, #17776:
URL: https://github.com/apache/hudi/pull/17776


   ### Describe the issue this Pull Request addresses
   
   When using HUDIIncrSource in HUDIStreamer, the data is read from an existing 
HUDI table using IncrementalRelation. The skeleton schema (containing Hudi meta 
fields like `_hoodie_record_key`, `_hoodie_partition_path`, etc.) is merged 
with the data schema to create the schema used to read from parquet files.
   
   If the data schema already contains any of these meta fields, Spark fails 
with duplicate field errors. This commonly occurs when:
   - Creating derived datasets using HUDIIncrSource where auto-derived schema 
(RowBasedSchemaProvider) is used
   - The source table's data already contains the `_hoodie_partition_path` 
field which is not removed from the read data
   
   Issue: [HUDI-4966](https://issues.apache.org/jira/browse/HUDI-4966)
   
   ### Summary and Changelog
   
   **Summary:** Filter out duplicate fields from skeleton schema when merging 
with data schema in IncrementalRelation to prevent Spark duplicate field errors.
   
   **Changelog:**
   - Modified `IncrementalRelationV1.scala` to filter skeleton schema fields 
that already exist in data schema before merging
   - Modified `IncrementalRelationV2.scala` with the same fix
   - No code was copied from external sources
   
   ### Impact
   
   No public API changes. This is a bug fix that prevents runtime errors when 
reading from Hudi tables where the data schema contains Hudi meta fields.
   
   ### Risk Level
   
   Low - The change only filters out duplicate fields during schema merging. 
The fix is defensive and only activates when duplicates would otherwise cause 
an error. The logic is simple and well-understood.
   
   ### Documentation Update
   
   None - This is a bug fix with no new features, configs, or user-facing 
changes.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(spark): Ignore duplicate fields when merging schema in IncrementalRelation [hudi]

Reply via email to