suryaprasanna opened a new pull request, #17942: URL: https://github.com/apache/hudi/pull/17942
### Describe the issue this Pull Request addresses Datasets with large record_index partitions cause OOM errors during clean planning when listing files. The current implementation uses the distributed engine context, which means increasing executor memory would affect all executors unnecessarily. ### Summary and Changelog This PR addresses OOM errors during clean planning by using local engine context (driver-only) for metadata tables and non-partitioned datasets. This allows scaling only driver memory instead of all executor memory. Changes: - Added new config `hoodie.clean.planner.use.local.engine.on.metadata.and.non-partitioned.tables` (default: true) - Modified `CleanPlanActionExecutor` to use `HoodieLocalEngineContext` for metadata tables and non-partitioned datasets - Added corresponding getter method in `HoodieWriteConfig` ### Impact New config property that changes the execution context for clean planning on specific table types. Users can now handle OOM errors by tuning only driver memory instead of all executor memory. ### Risk Level Low - The change is isolated to clean planning and only affects metadata tables and non-partitioned datasets. The new config defaults to true, which is the optimized behavior. ### Documentation Update Config description added in `HoodieCleanConfig.java` documenting the new property and its usage. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
