[GitHub] [hudi] danny0405 commented on a diff in pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

via GitHub Mon, 26 Jun 2023 19:33:09 -0700


danny0405 commented on code in PR #9057:
URL: https://github.com/apache/hudi/pull/9057#discussion_r1243067339



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -344,6 +344,13 @@ private boolean initializeFromFilesystem(String 
initializationTime, List<Metadat
     if (!filesPartitionAvailable) {
       partitionsToInit.remove(MetadataPartitionType.FILES);
       partitionsToInit.add(0, MetadataPartitionType.FILES);
+      // By default we allocate 10 FG's in RLI. but for very large tables, 
these needs to be configured properly. For an existing table, we dynamically
+      // deduce the number of file groups for RLI. but for a fresh table since 
there are no records, we might choose default value of 10.
+      // So, deferring te instantiation of RLI to atleast 1 completed commit 
in DT.
+      if (partitionsToInit.contains(MetadataPartitionType.RECORD_INDEX) &&
+          
dataMetaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants().countInstants()
 == 0) {
+        partitionsToInit.remove(MetadataPartitionType.RECORD_INDEX);
+      }

Review Comment:
   I'm wondering if the estimation of file groups cnt work well in production, 
it only takes the exsiting record count for file group estimation, even though 
there is a growth factor that can be configured by user explicitly, it is still 
somehow fixed (not dynamic), in production, most of the table records NDV are 
increasing consinuously, so the estimation may never work as expected, 
especially for fresh new table.
   
   I would prefer the initial 10 for fresh new table, because very probabily 
the estimation result with the 1st commit is smaller than 10, but with time 
goes by, we need far more file groups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on a diff in pull request #9057: [HUDI-6446] Fixing MDT commit time parsing and RLI instantiation with MDT

Reply via email to