kbuci opened a new pull request, #18306:
URL: https://github.com/apache/hudi/pull/18306

   ### Describe the issue this Pull Request addresses
   
   Currently, log compaction is always scheduled whenever the operation type is 
`LOG_COMPACT`, regardless of how many delta commits have occurred since the 
last log compaction. This leads to unnecessary log compaction scheduling, 
wasting resources when few delta commits (and therefore most likely only a few 
log files/blocks) have accumulated.
   
   ### Summary and Changelog
   
   Changes log compaction scheduling to use the `LogCompactionBlocksThreshold` 
config as a gating threshold. Instead of unconditionally scheduling log 
compaction, the scheduler now counts delta commits since the last compaction 
and the last log compaction, takes the minimum of the two, and only schedules 
log compaction when that count meets or exceeds the threshold.
   
   - Added `CompactionUtils.getDeltaCommitsSinceLatestLogCompaction()` which 
determines the number of delta commits since the most recent completed log 
compaction by inspecting the raw active timeline (needed because completed log 
compaction instants transition from `LOG_COMPACTION_ACTION` to 
`DELTA_COMMIT_ACTION`)
   - Added 
`ScheduleCompactionActionExecutor.getDeltaCommitInfoSinceLogCompaction()` which 
creates a raw active timeline and delegates to the new `CompactionUtils` method
   - Renamed `getLatestDeltaCommitInfo()` to 
`getLatestDeltaCommitInfoSinceCompaction()` for clarity
   - Updated `needCompact()` to replace the unconditional `return true` for 
`LOG_COMPACT` with threshold-based logic: 
`Math.min(deltaCommitsSinceCompaction, deltaCommitsSinceLogCompaction) >= 
logCompactionBlocksThreshold`
   - Added unit tests for `getDeltaCommitsSinceLatestLogCompaction` covering 
completed log compaction, no log compaction, and empty timeline cases
   
   ### Impact
   
   No public API changes. Log compaction will now be scheduled less frequently 
— only when enough delta commits have accumulated since the last compaction or 
log compaction to meet the `hoodie.log.compaction.blocks.threshold` (default: 
5). This reduces unnecessary log compaction overhead for tables with frequent 
small writes.
   
   ### Risk Level
   
   Low. The change only affects log compaction scheduling frequency. Regular 
compaction scheduling is unchanged.
   ### Documentation Update
   
   None. No new configs are introduced; the existing 
`hoodie.log.compaction.blocks.threshold` config is now also used to gate 
scheduling frequency in addition to its existing role in plan generation.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to