kbuci opened a new pull request, #18215:
URL: https://github.com/apache/hudi/pull/18215

   ### Summary and Changelog
   
   **Summary:** MDT archival now respects the data table’s latest clean ECTR. 
When present and still on the data table active timeline, the “earliest instant 
to retain” for MDT archival is derived from that ECTR instead of the earliest 
data table commit, allowing more MDT instants to be archived when safe.
   
   **Changelog:**
   
   - **TimelineUtils**
     - `getEarliestInstantForMetadataArchival` now takes a third parameter: 
`Option<String> earliestUncleanedDataTableInstantTimeOption` (ECTR from the 
latest clean).
     - Logic uses “earliest possible restore commit” (ECTR if present and on 
timeline, else first commit), “smallest” of that and inflight, and when not 
archiving beyond savepoint, the minimum of first savepointed write, earliest 
inflight, and earliest possible restore commit.
     - Added `getEarliestRetainedCommitFromLastClean(HoodieTableMetaClient)` to 
read ECTR from the latest completed clean; throws `HoodieIOException` on read 
failure.
     - Added `findSmallestInstant(Option<HoodieInstant>, 
Option<HoodieInstant>)` and 
`getEarliestPossibleRestoreCommit(HoodieActiveTimeline, Option<String>)`.
   - **HoodieTimeline / BaseHoodieTimeline**
     - New method `findFirstSavepointedWrite()`: returns the first write commit 
that has been savepointed (used when blocking MDT archival on savepoints).
   - **TimelineArchiverV1 and TimelineArchiverV2**
     - When computing the earliest instant to retain for the metadata table, 
both archivers now pass the data table ECTR: 
`TimelineUtils.getEarliestRetainedCommitFromLastClean(dataMetaClient)` is 
supplied as the third argument to `getEarliestInstantForMetadataArchival`.
   - **Tests**
     - **TestTimelineUtils:** All `getEarliestInstantForMetadataArchival` call 
sites updated to the 3-arg signature; added cases for ECTR present, ECTR not on 
timeline, and savepoints + ECTR (earlier than savepoints and between 
savepoints).
     - **HoodieTestTable:** Added `addIncrementalClean(String instantTime, 
String earliestCommitToRetain)` for tests that need a clean with a given ECTR.
     - **TestHoodieTimelineArchiver:** Added 
`testArchivalInMetadataTableCanProceedUntilECTR` to assert MDT archival 
proceeds up to the data table ECTR.
   
   ### Impact
   
   - **User-facing / API:** No new public APIs. Behavior change only for tables 
with the metadata table enabled and at least one completed clean that has an 
ECTR; in that case, MDT archival may retain fewer instants (archive more), 
which is the intended correction.
   - **Performance:** Slightly less MDT timeline to scan when ECTR is present, 
due to more aggressive archival.
   
   ### Risk Level
   
   **Low.** Changes are limited to the MDT archival path and only apply when 
(1) the table has the metadata table enabled and (2) the data table has a 
completed clean with an ECTR. Logic is covered by unit tests 
(`TestTimelineUtils#testGetEarliestInstantForMetadataArchival`) and an 
integration test 
(`TestHoodieTimelineArchiver#testArchivalInMetadataTableCanProceedUntilECTR`). 
Failure to read clean metadata now throws `HoodieIOException` instead of 
returning empty, making I/O errors explicit.
   
   ### Documentation Update
   
   None. This is an internal correction to MDT archival behavior when the data 
table has run clean with an ECTR; no new configs or user-facing features are 
added.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to