suryaprasanna opened a new pull request, #17943: URL: https://github.com/apache/hudi/pull/17943
### Describe the issue this Pull Request addresses Some datasets experience longer execution times during clean operations due to unnecessary clean checks. This PR optimizes the clean operation for MOR (Merge-on-Read) tables by skipping clean when it's not needed. ### Summary and Changelog Introduced logic to skip clean operations when they are unnecessary in MOR tables, specifically when the last compaction timestamp is less than the last clean timestamp and there are no non-delta commits between them. **Changes:** - Added `canCleanBeSkipped()` method in `CleanPlanner.java` to determine if clean operation can be skipped - The method checks if last compaction's completion time is before last clean's requested time and ensures no non-delta commits exist between them - Applied the optimization for `KEEP_LATEST_FILE_VERSIONS` cleaning policy - Added necessary imports for `HoodieTableType` and `InstantComparison` ### Impact This change improves performance for datasets experiencing slow clean operations by avoiding unnecessary clean checks. The optimization only applies to MOR tables and does not affect COW (Copy-on-Write) tables. ### Risk Level **Low** - This optimization only skips clean operations when they are provably unnecessary (when no new files have been created since the last clean). The logic ensures data correctness is maintained by checking that only delta commits occurred between the last compaction and last clean. ### Documentation Update None - This is an internal performance optimization that doesn't introduce new features or configuration options. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
