voonhous opened a new issue, #6716:
URL: https://github.com/apache/hudi/issues/6716
Hello Hudi, this is a question regarding the design considerations between
metadata table (MDT) and the archiving commit action on a data table (DT).
When performing archival of commits on the DT, at least one compaction is
required to be performed on the MDT.
```java
// If metadata table is enabled, do not archive instants which are more
recent than the last compaction on the
// metadata table.
if (config.isMetadataTableEnabled()) {
try (HoodieTableMetadata tableMetadata =
HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(),
config.getBasePath(),
FileSystemViewStorageConfig.SPILLABLE_DIR.defaultValue())) {
Option<String> latestCompactionTime =
tableMetadata.getLatestCompactionTime();
if (!latestCompactionTime.isPresent()) {
LOG.info("Not archiving as there is no compaction yet on the
metadata table");
instants = Stream.empty();
} else {
LOG.info("Limiting archiving of instants to latest compaction on
metadata table at " + latestCompactionTime.get());
instants = instants.filter(instant ->
HoodieTimeline.compareTimestamps(instant.getTimestamp(),
HoodieTimeline.LESSER_THAN,
latestCompactionTime.get()));
}
} catch (Exception e) {
throw new HoodieException("Error limiting instant archival based on
metadata table", e);
}
}
```
Assuming that a DT has MDT enabled (by default for Spark entrypoints), and
ONLY **INSERT-OVERWRITE** actions are performed on the DT (a table service
action generating `replacecommit`s), archival of commits will not be performed.
This is so as compaction on the MDT is never performed if a table service
action is performed on the DT.
As such, one can see that archival service on DT is dependent on MDT's
compaction service, which is dependent on DT's data manipulation operations.
TLDR: I am unsure as to what design considerations are involved in putting
such restrictions in place, hence am consulting the community as to why this is
the case.
Thank you.
**Environment Description**
* Hudi version : 0.11.1
* Spark version : 3.1
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]