Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

via GitHub Tue, 19 Mar 2024 05:50:30 -0700


codope commented on code in PR #10874:
URL: https://github.com/apache/hudi/pull/10874#discussion_r1530294507



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1410,35 +1388,19 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
   /**
    * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
    */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option<String> 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-    // we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-    // Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-    // a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-    // b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-    // any instants before that is already synced with metadata table.
-    // c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-    // instant before c4 is synced with metadata table.
-    List<HoodieInstant> pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-        
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-    if (!pendingInstants.isEmpty()) {
-      checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-      LOG.info(String.format(
-          "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-          pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray())));
-      return false;
-    }
-
-    // Check if there are any pending compaction or log compaction instants in 
the timeline.
-    // If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-    Option<HoodieInstant> pendingLogCompactionInstant =
-        
metadataMetaClient.getActiveTimeline().filterPendingLogCompactionTimeline().firstInstant();
-    Option<HoodieInstant> pendingCompactionInstant =
-        
metadataMetaClient.getActiveTimeline().filterPendingCompactionTimeline().firstInstant();
-    if (pendingLogCompactionInstant.isPresent() || 
pendingCompactionInstant.isPresent()) {
-      LOG.warn(String.format("Not scheduling compaction or logCompaction, 
since a pending compaction instant %s or logCompaction %s instant is present",
-          pendingCompactionInstant, pendingLogCompactionInstant));
-      return false;
+  protected boolean validateCompactionScheduling() {
+    // Under the log compaction scope, the sequence of the log-compaction and 
compaction needs to be ensured because metadata items such as RLI
+    // only has proc-time ordering semantics. For "ensured", it means the 
completion sequence of the log-compaction/compaction is the same as the start 
sequence.
+    if (metadataWriteConfig.isLogCompactionEnabled()) {

Review Comment:
   @danny0405 The file slicing even for RLI is based on completion time. Can 
you please explain with an example? 
   I think we can simplify for logcompaction as well. If I understand 
correctly, in case we allow to schedule logcompaction when there is a previous 
pending logcompaction, the file slices in the two plans should not intersect. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1407,43 +1382,7 @@ protected void cleanIfNecessary(BaseHoodieWriteClient 
writeClient, String instan
     writeClient.lazyRollbackFailedIndexing();
   }
 
-  /**
-   * Validates the timeline for both main and metadata tables to ensure 
compaction on MDT can be scheduled.
-   */
-  protected boolean validateTimelineBeforeSchedulingCompaction(Option<String> 
inFlightInstantTimestamp, String latestDeltaCommitTimeInMetadataTable) {
-    // we need to find if there are any inflights in data table timeline 
before or equal to the latest delta commit in metadata table.
-    // Whenever you want to change this logic, please ensure all below 
scenarios are considered.
-    // a. There could be a chance that latest delta commit in MDT is committed 
in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed
-    // b. There could be DT inflights after latest delta commit in MDT and we 
are ok with it. bcoz, the contract is, the latest compaction instant time in 
MDT represents
-    // any instants before that is already synced with metadata table.
-    // c. Do consider out of order commits. For eg, c4 from DT could complete 
before c3. and we can't trigger compaction in MDT with c4 as base instant time, 
until every
-    // instant before c4 is synced with metadata table.
-    List<HoodieInstant> pendingInstants = 
dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-        
.findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants();
-
-    if (!pendingInstants.isEmpty()) {
-      checkNumDeltaCommits(metadataMetaClient, 
dataWriteConfig.getMetadataConfig().getMaxNumDeltacommitsWhenPending());
-      LOG.info(String.format(
-          "Cannot compact metadata table as there are %d inflight instants in 
data table before latest deltacommit in metadata table: %s. Inflight instants 
in data table: %s",
-          pendingInstants.size(), latestDeltaCommitTimeInMetadataTable, 
Arrays.toString(pendingInstants.toArray())));
-      return false;
-    }
-
-    // Check if there are any pending compaction or log compaction instants in 
the timeline.
-    // If pending compact/logCompaction operations are found abort scheduling 
new compaction/logCompaction operations.
-    Option<HoodieInstant> pendingLogCompactionInstant =

Review Comment:
   +1 for tests. 
   >> Pending logcompaction check present in HoodieLogCompactionPlanGenerator 
class, will exclude file groups that are currently under logcompaction, that 
means other file groups or partitions in metadata can still create another 
logcompaction plan.
   
   > But there is no plan on the file slice that has pending plans 
(compaction/log_compaction), so the plan should be still valid?
   
   yeah that's what should happen. Pending file slice should not be touched by 
a later logcompaction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7510] Loosen the compaction scheduling and rollback check for MDT [hudi]

Reply via email to