danny0405 commented on code in PR #18215:
URL: https://github.com/apache/hudi/pull/18215#discussion_r2844492059
##########
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java:
##########
@@ -355,42 +353,89 @@ public static HoodieCommitMetadata getCommitMetadata(
* for the archival in metadata table.
* <p>
* the qualified earliest instant is chosen as the earlier one between the
earliest
- * commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering
non-savepoint
+ * commit that dataset can be potentially restored to
+ * (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only, considering non-savepoint
* commit only if enabling archive beyond savepoint) and the earliest
inflight
* instant (all actions).
*
* @param dataTableActiveTimeline the active timeline of the data table.
* @param shouldArchiveBeyondSavepoint whether to archive beyond savepoint.
+ * @param earliestUncleanedDataTableInstantTimeOption ECTR from latest clean
on data table, if present.
* @return the instant meeting the requirement.
*/
public static Option<HoodieInstant> getEarliestInstantForMetadataArchival(
- HoodieActiveTimeline dataTableActiveTimeline, boolean
shouldArchiveBeyondSavepoint) {
+ HoodieActiveTimeline dataTableActiveTimeline, boolean
shouldArchiveBeyondSavepoint,
+ Option<String> earliestUncleanedDataTableInstantTimeOption) {
// This is for commits only, not including CLEAN, ROLLBACK, etc.
- // When archive beyond savepoint is enabled, there are chances that there
could be holes
- // in the timeline due to archival and savepoint interplay. So, the first
non-savepoint
- // commit in the data timeline is considered as beginning of the active
timeline.
- Option<HoodieInstant> earliestCommit = shouldArchiveBeyondSavepoint
- ? dataTableActiveTimeline.getTimelineOfActions(
- CollectionUtils.createSet(
- COMMIT_ACTION, DELTA_COMMIT_ACTION, REPLACE_COMMIT_ACTION,
CLUSTERING_ACTION, SAVEPOINT_ACTION))
- .getFirstNonSavepointCommit()
- : dataTableActiveTimeline.getCommitsTimeline().firstInstant();
- // This is for all instants which are in-flight
+
+ // If it exists and is still in active timeline of data table, find
+ // the ECTR of latest clean in data table. Otherwise, default to the
earliest commit
+ // in timeline, as depending on the user's ingestion workload and clean
policy
+ // they may still be able and wish to restore to that instant for recovery
purposes
+ Option<HoodieInstant> earliestPossibleRestoreCommit =
getEarliestPossibleRestoreCommit(dataTableActiveTimeline,
+ earliestUncleanedDataTableInstantTimeOption);
+
+ // If there are any inflight instants, then get the earlier instant
(between inflight and the completed
+ // restore-able commit)
Option<HoodieInstant> earliestInflight =
dataTableActiveTimeline.filterInflightsAndRequested().firstInstant();
- if (earliestCommit.isPresent() && earliestInflight.isPresent()) {
- if (earliestCommit.get().compareTo(earliestInflight.get()) < 0) {
- return earliestCommit;
+ if (shouldArchiveBeyondSavepoint) {
+ return findSmallestInstant(earliestInflight,
earliestPossibleRestoreCommit);
Review Comment:
in the original logic, when `shouldArchiveBeyondSavepoint` is true,
`#getFirstNonSavepointCommit` is invoked to filter out the savepointed instant,
do we also need to ensure `earliestPossibleRestoreCommit` as non savepointed?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]