[PR] [HUDI-18980] Propagate merge configs to file group reader during clustering [hudi]

via GitHub Sun, 14 Jun 2026 22:16:09 -0700


ad1happy2go opened a new pull request, #19007:
URL: https://github.com/apache/hudi/pull/19007


   ### Describe the issue this Pull Request addresses
   
   Clustering (inline and standalone) fails on a table configured with 
`hoodie.write.record.merge.mode=CUSTOM` and a custom merger via 
`hoodie.write.record.merge.custom.implementation.classes`, with:
   
   ```
   java.lang.IllegalArgumentException: No valid spark merger implementation set 
for `hoodie.write.record.merge.custom.implementation.classes`
       at 
org.apache.hudi.BaseSparkInternalRowReaderContext.getRecordMerger(BaseSparkInternalRowReaderContext.java:74)
       at 
org.apache.hudi.common.engine.HoodieReaderContext.initRecordMerger(HoodieReaderContext.java:332)
       at 
org.apache.hudi.common.table.read.HoodieFileGroupReader.<init>(HoodieFileGroupReader.java:111)
       at 
org.apache.hudi.table.action.cluster.strategy.ClusteringExecutionStrategy.getFileGroupReader(ClusteringExecutionStrategy.java)
       at 
org.apache.hudi.client.clustering.run.strategy.MultipleSparkJobExecutionStrategy...
   ```
   
   The same custom merger works for compaction and for MOR reads — only the 
clustering path fails.
   
   GitHub issue: https://github.com/apache/hudi/issues/18980
   
   ### Summary and Changelog
   
   - `ClusteringExecutionStrategy.getReaderProperties()` built a fresh 
`TypedProperties` containing only the spill-map / memory keys, dropping all 
merge-related configs. When `HoodieFileGroupReader` reads the source file 
groups it calls `initRecordMerger`, which resolves the merge mode and strategy 
id from the table config (`CUSTOM`) but reads the merger impl classes from the 
reader properties — which were empty — so the configured merger cannot be 
instantiated and the read fails.
   - Fix: seed `getReaderProperties()` from 
`TypedProperties.copy(config.getProps())` before applying the spill/memory 
overrides, so the custom merger impl classes, merge mode and strategy id reach 
the reader. This mirrors the compaction read path 
(`FileGroupReaderBasedMergeHandle`), which already seeds its reader props from 
the full write config.
   - Adds `TestClusteringWithCustomMerger` (RDD-based clustering of a MOR table 
with a `CUSTOM` Spark merger), asserting clustering completes as a replace 
commit and all records remain readable.
   
   ### Impact
   
   Clustering now works with `CUSTOM` record merge mode. No behavior change for 
the non-custom merge modes — the reader already had the merge mode / strategy 
id available from table config; this only additionally surfaces the impl-class 
property that was previously dropped. No public API or config changes.
   
   ### Risk Level
   
   low
   
   Verified end-to-end on Spark 3.5.6 against a MOR table with a custom 
`HoodieSparkRecordMerger`: before the change, inline clustering failed on every 
task with the error above; after the change, clustering completes (replace 
commit) with all records preserved.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-18980] Propagate merge configs to file group reader during clustering [hudi]

Reply via email to