nbalajee commented on a change in pull request #3426:
URL: https://github.com/apache/hudi/pull/3426#discussion_r686393903



##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -157,6 +147,37 @@ protected void commit(List<HoodieRecord> records, String 
partitionName, String i
         .lastInstant().map(HoodieInstant::getTimestamp);
   }
 
+  /**
+   *  Perform a compaction on the Metadata Table.
+   *
+   * We cannot perform compaction if there are previous inflight operations on 
the dataset. This is because
+   * a compacted metadata base file at time Tx should represent all the 
actions on the dataset till time Tx.
+   */
+  private void compactIfNecessary(SparkRDDWriteClient writeClient, String 
instantTime) {
+    List<HoodieInstant> pendingInstants = 
datasetMetaClient.reloadActiveTimeline().filterInflightsAndRequested().findInstantsBefore(instantTime)
+        .getInstants().collect(Collectors.toList());
+    if (!pendingInstants.isEmpty()) {

Review comment:
       If the dataset timeline has an incomplete commit (multiple parallel 
commits C1, C2 were started, C2 succeeded but C1 failed leaving C1.inflight).   
Dataset commits and delta commits on metadata table will be successful, but 
compaction could fall behind, until manual intervention.
   
   With current approach, manual intervention would be required to clean up the 
inflight to allow compaction to make progress or would ingestion/dataset 
commits fail due to maxArchivalLimit on metadata table is reached(due to delta 
commits created, but not compacted)?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to