nsivabalan commented on code in PR #8684:
URL: https://github.com/apache/hudi/pull/8684#discussion_r1220170726
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1052,51 +1091,81 @@ protected HoodieData<HoodieRecord>
prepRecords(Map<MetadataPartitionType,
}
/**
- * Perform a compaction on the Metadata Table.
- *
- * Cases to be handled:
- * 1. We cannot perform compaction if there are previous inflight
operations on the dataset. This is because
- * a compacted metadata base file at time Tx should represent all the
actions on the dataset till time Tx.
- *
- * 2. In multi-writer scenario, a parallel operation with a greater
instantTime may have completed creating a
- * deltacommit.
+ * Optimize the metadata table by running compaction, clean and archive as
required.
+ * <p>
+ * Don't perform optimization if there are inflight operations on the
dataset. This is for two reasons:
+ * - The compaction will contain the correct data as all failed operations
have been rolled back.
+ * - Clean/compaction etc. will have the highest timestamp on the MDT and we
won't be adding new operations
+ * with smaller timestamps to metadata table (makes for easier debugging)
+ * <p>
+ * This adds the limitations that long-running async operations (clustering,
etc.) may cause delay in such MDT
+ * optimizations. We will relax this after MDT code has been hardened.
*/
- protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String
instantTime) {
- // finish off any pending compactions if any from previous attempt.
- writeClient.runAnyPendingCompactions();
-
- String latestDeltaCommitTimeInMetadataTable =
metadataMetaClient.reloadActiveTimeline()
- .getDeltaCommitTimeline()
- .filterCompletedInstants()
- .lastInstant().orElseThrow(() -> new HoodieMetadataException("No
completed deltacommit in metadata table"))
- .getTimestamp();
- // we need to find if there are any inflights in data table timeline
before or equal to the latest delta commit in metadata table.
- // Whenever you want to change this logic, please ensure all below
scenarios are considered.
- // a. There could be a chance that latest delta commit in MDT is committed
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
- // b. There could be DT inflights after latest delta commit in MDT and we
are ok with it. bcoz, the contract is, latest compaction instant time in MDT
represents
- // any instants before that is already synced with metadata table.
- // c. Do consider out of order commits. For eg, c4 from DT could complete
before c3. and we can't trigger compaction in MDT with c4 as base instant time,
until every
- // instant before c4 is synced with metadata table.
- List<HoodieInstant> pendingInstants =
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
+ @Override
+ public void performTableServices(Option<String> inFlightInstantTimestamp) {
+ HoodieTimer metadataTableServicesTimer = HoodieTimer.start();
+ boolean allTableServicesExecutedSuccessfullyOrSkipped = true;
+ try {
+ BaseHoodieWriteClient writeClient = getWriteClient();
+ // Run any pending table services operations.
+ runPendingTableServicesOperations(writeClient);
+
+ // Check and run clean operations.
+ String latestDeltacommitTime =
metadataMetaClient.reloadActiveTimeline().getDeltaCommitTimeline()
+ .filterCompletedInstants()
+ .lastInstant().get()
+ .getTimestamp();
+ LOG.info("Latest deltacommit time found is " + latestDeltacommitTime +
", running clean operations.");
+ cleanIfNecessary(writeClient, latestDeltacommitTime);
+
+ // Do timeline validation before scheduling compaction/logcompaction
operations.
+ if
(!validateTimelineBeforeSchedulingCompaction(inFlightInstantTimestamp,
latestDeltacommitTime)) {
+ return;
Review Comment:
I see your point. but in reality, I feel we may not gain much. unless
compaction in MDT kicks in, archival might not have anything to do after last
time it was able to archive something. So, not a bad idea to not trigger
archival if there is no compaction.
Same applies for clean as well. but lets not optimize anything in this
patch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]