[GitHub] [hudi] nsivabalan commented on a change in pull request #3590: [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426

GitBox Sun, 26 Sep 2021 12:41:30 -0700


nsivabalan commented on a change in pull request #3590:
URL: https://github.com/apache/hudi/pull/3590#discussion_r716251124




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##########
@@ -401,64 +394,83 @@ private boolean 
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
   }
 
   /**
-   * Sync the Metadata Table from the instants created on the dataset.
+   * Initialize file groups for a partition. For file listing, we just have 
one file group.
    *
-   * @param datasetMetaClient {@code HoodieTableMetaClient} for the dataset
+   * All FileGroups for a given metadata partition has a fixed prefix as per 
the {@link MetadataPartitionType#getFileIdPrefix()}.
+   * Each file group is suffixed with increments of 1 starting with 1.
+   *
+   * For instance, for FILES, there is only one file group named as "files-1"
+   * Lets say we configure 10 file groups for record level index, and prefix 
as "record-index-bucket-"
+   * Filegroups will be named as :
+   *    record-index-bucket-01
+   *    record-index-bucket-02
+   *    ...
+   *    record-index-bucket-10
    */
-  private void syncFromInstants(HoodieTableMetaClient datasetMetaClient) {
-    ValidationUtils.checkState(enabled, "Metadata table cannot be synced as it 
is not enabled");
-    // (re) init the metadata for reading.
-    initTableMetadata();
-    try {
-      List<HoodieInstant> instantsToSync = 
metadata.findInstantsToSyncForWriter();
-      if (instantsToSync.isEmpty()) {
-        return;
-      }
-
-      LOG.info("Syncing " + instantsToSync.size() + " instants to metadata 
table: " + instantsToSync);
-
-      // Read each instant in order and sync it to metadata table
-      for (HoodieInstant instant : instantsToSync) {
-        LOG.info("Syncing instant " + instant + " to metadata table");
-
-        Option<List<HoodieRecord>> records = 
HoodieTableMetadataUtil.convertInstantToMetaRecords(datasetMetaClient,
-            metaClient.getActiveTimeline(), instant, metadata.getUpdateTime());
-        if (records.isPresent()) {
-          commit(records.get(), MetadataPartitionType.FILES.partitionPath(), 
instant.getTimestamp());
-        }
+  private void initializeFileGroups(HoodieTableMetaClient datasetMetaClient, 
MetadataPartitionType metadataPartition, String instantTime,
+                                    int fileGroupCount) throws IOException {
+
+    final HashMap<HeaderMetadataType, String> blockHeader = new HashMap<>();
+    blockHeader.put(HeaderMetadataType.INSTANT_TIME, instantTime);
+    // Archival of data table has a dependency on compaction(base files) in 
metadata table.
+    // It is assumed that as of time Tx of base instant (/compaction time) in 
metadata table,
+    // all commits in data table is in sync with metadata table. So, we always 
create start with log file for any fileGroup.
+    final HoodieDeleteBlock block = new HoodieDeleteBlock(new HoodieKey[0], 
blockHeader);

Review comment:
       I feel it may not be easy to relax this. we can discuss this async as we 
close out this patch. 
   here are the two dependencies, of which 1 could be relaxed. 
   1. during rollback, we check if commit that is being rollback is already 
synced or not. if it is < last compacted time, we assume it's already synced. 
We can get away with this if need be. We can always assume, if the commit being 
rollbacked is not part of active timeline of metadata, its not synced and go 
ahead with it. Only difference we might have here is, there could be some 
additional files which are added to delete list which are not original synced 
to metadata table at all. 
   2. archival of dataset if dependent on compaction in metadata table. This 
might need more thoughts. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a change in pull request #3590: [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426

Reply via email to