voonhous opened a new issue, #6716:
URL: https://github.com/apache/hudi/issues/6716

   Hello Hudi, this is a question regarding the design considerations between 
metadata table (MDT) and the archiving commit action on a data table (DT).
   
   When performing archival of commits on the DT,  at least one compaction is 
required to be performed on the MDT.
   
   ```java
       // If metadata table is enabled, do not archive instants which are more 
recent than the last compaction on the
       // metadata table.
       if (config.isMetadataTableEnabled()) {
         try (HoodieTableMetadata tableMetadata = 
HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(),
             config.getBasePath(), 
FileSystemViewStorageConfig.SPILLABLE_DIR.defaultValue())) {
           Option<String> latestCompactionTime = 
tableMetadata.getLatestCompactionTime();
           if (!latestCompactionTime.isPresent()) {
             LOG.info("Not archiving as there is no compaction yet on the 
metadata table");
             instants = Stream.empty();
           } else {
             LOG.info("Limiting archiving of instants to latest compaction on 
metadata table at " + latestCompactionTime.get());
             instants = instants.filter(instant -> 
HoodieTimeline.compareTimestamps(instant.getTimestamp(), 
HoodieTimeline.LESSER_THAN,
                 latestCompactionTime.get()));
           }
         } catch (Exception e) {
           throw new HoodieException("Error limiting instant archival based on 
metadata table", e);
         }
       }
   ```
   
   Assuming that a DT has MDT enabled (by default for Spark entrypoints), and 
ONLY **INSERT-OVERWRITE** actions are performed on the DT (a table service 
action generating `replacecommit`s), archival of commits will not be performed.
   
   This is so as compaction on the MDT is never performed if a table service 
action is performed on the DT. 
   
   As such, one can see that archival service on DT is dependent on MDT's 
compaction service, which is dependent on DT's data manipulation operations.
   
   TLDR: I am unsure as to what design considerations are involved in putting 
such restrictions in place, hence am consulting the community as to why this is 
the case.
   
   Thank you.
   
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to