nbalajee commented on a change in pull request #3426:
URL: https://github.com/apache/hudi/pull/3426#discussion_r686393903
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##########
@@ -157,6 +147,37 @@ protected void commit(List<HoodieRecord> records, String
partitionName, String i
.lastInstant().map(HoodieInstant::getTimestamp);
}
+ /**
+ * Perform a compaction on the Metadata Table.
+ *
+ * We cannot perform compaction if there are previous inflight operations on
the dataset. This is because
+ * a compacted metadata base file at time Tx should represent all the
actions on the dataset till time Tx.
+ */
+ private void compactIfNecessary(SparkRDDWriteClient writeClient, String
instantTime) {
+ List<HoodieInstant> pendingInstants =
datasetMetaClient.reloadActiveTimeline().filterInflightsAndRequested().findInstantsBefore(instantTime)
+ .getInstants().collect(Collectors.toList());
+ if (!pendingInstants.isEmpty()) {
Review comment:
If the dataset timeline has an incomplete commit (multiple parallel
commits C1, C2 were started, C2 succeeded but C1 failed leaving C1.inflight).
Dataset commits and delta commits on metadata table will be successful, but
compaction could fall behind, until manual intervention.
With current approach, manual intervention would be required to clean up the
inflight to allow compaction to make progress or would ingestion/dataset
commits fail due to maxArchivalLimit on metadata table is reached(due to delta
commits created, but not compacted)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]