suryaprasanna opened a new pull request, #17943:
URL: https://github.com/apache/hudi/pull/17943

   ### Describe the issue this Pull Request addresses
   
   Some datasets experience longer execution times during clean operations due 
to unnecessary clean checks. This PR optimizes the clean operation for MOR 
(Merge-on-Read) tables by skipping clean when it's not needed.
   
   ### Summary and Changelog
   
   Introduced logic to skip clean operations when they are unnecessary in MOR 
tables, specifically when the last compaction timestamp is less than the last 
clean timestamp and there are no non-delta commits between them.
   
   **Changes:**
   - Added `canCleanBeSkipped()` method in `CleanPlanner.java` to determine if 
clean operation can be skipped
   - The method checks if last compaction's completion time is before last 
clean's requested time and ensures no non-delta commits exist between them
   - Applied the optimization for `KEEP_LATEST_FILE_VERSIONS` cleaning policy
   - Added necessary imports for `HoodieTableType` and `InstantComparison`
   
   ### Impact
   
   This change improves performance for datasets experiencing slow clean 
operations by avoiding unnecessary clean checks. The optimization only applies 
to MOR tables and does not affect COW (Copy-on-Write) tables.
   
   ### Risk Level
   
   **Low** - This optimization only skips clean operations when they are 
provably unnecessary (when no new files have been created since the last 
clean). The logic ensures data correctness is maintained by checking that only 
delta commits occurred between the last compaction and last clean.
   
   ### Documentation Update
   
   None - This is an internal performance optimization that doesn't introduce 
new features or configuration options.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to