suryaprasanna opened a new pull request, #17942:
URL: https://github.com/apache/hudi/pull/17942

   ### Describe the issue this Pull Request addresses
   
   Datasets with large record_index partitions cause OOM errors during clean 
planning when listing files. The current implementation uses the distributed 
engine context, which means increasing executor memory would affect all 
executors unnecessarily.
   
   ### Summary and Changelog
   
   This PR addresses OOM errors during clean planning by using local engine 
context (driver-only) for metadata tables and non-partitioned datasets. This 
allows scaling only driver memory instead of all executor memory.
   
   Changes:
   - Added new config 
`hoodie.clean.planner.use.local.engine.on.metadata.and.non-partitioned.tables` 
(default: true)
   - Modified `CleanPlanActionExecutor` to use `HoodieLocalEngineContext` for 
metadata tables and non-partitioned datasets
   - Added corresponding getter method in `HoodieWriteConfig`
   
   ### Impact
   
   New config property that changes the execution context for clean planning on 
specific table types. Users can now handle OOM errors by tuning only driver 
memory instead of all executor memory.
   
   ### Risk Level
   
   Low - The change is isolated to clean planning and only affects metadata 
tables and non-partitioned datasets. The new config defaults to true, which is 
the optimized behavior.
   
   ### Documentation Update
   
   Config description added in `HoodieCleanConfig.java` documenting the new 
property and its usage.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to