prashantwason opened a new pull request, #17776: URL: https://github.com/apache/hudi/pull/17776
### Describe the issue this Pull Request addresses When using HUDIIncrSource in HUDIStreamer, the data is read from an existing HUDI table using IncrementalRelation. The skeleton schema (containing Hudi meta fields like `_hoodie_record_key`, `_hoodie_partition_path`, etc.) is merged with the data schema to create the schema used to read from parquet files. If the data schema already contains any of these meta fields, Spark fails with duplicate field errors. This commonly occurs when: - Creating derived datasets using HUDIIncrSource where auto-derived schema (RowBasedSchemaProvider) is used - The source table's data already contains the `_hoodie_partition_path` field which is not removed from the read data Issue: [HUDI-4966](https://issues.apache.org/jira/browse/HUDI-4966) ### Summary and Changelog **Summary:** Filter out duplicate fields from skeleton schema when merging with data schema in IncrementalRelation to prevent Spark duplicate field errors. **Changelog:** - Modified `IncrementalRelationV1.scala` to filter skeleton schema fields that already exist in data schema before merging - Modified `IncrementalRelationV2.scala` with the same fix - No code was copied from external sources ### Impact No public API changes. This is a bug fix that prevents runtime errors when reading from Hudi tables where the data schema contains Hudi meta fields. ### Risk Level Low - The change only filters out duplicate fields during schema merging. The fix is defensive and only activates when duplicates would otherwise cause an error. The logic is simple and well-understood. ### Documentation Update None - This is a bug fix with no new features, configs, or user-facing changes. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
