[GitHub] [hudi] prashantwason commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

via GitHub Tue, 13 Jun 2023 16:08:37 -0700


prashantwason commented on code in PR #8758:
URL: https://github.com/apache/hudi/pull/8758#discussion_r1228636738



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -111,18 +111,27 @@ public abstract class HoodieBackedTableMetadataWriter 
implements HoodieTableMeta
 
   public static final String METADATA_COMPACTION_TIME_SUFFIX = "001";
 
+  // Virtual keys support for metadata table. This Field is
+  // from the metadata payload schema.
+  private static final String RECORD_KEY_FIELD_NAME = 
HoodieMetadataPayload.KEY_FIELD_NAME;
+
+  // Average size of a record saved within the record index.
+  // Record index has a fixed size schema. This has been calculated based on 
experiments with default settings
+  // for block size (4MB), compression (GZ) and disabling the hudi metadata 
fields.

Review Comment:
   Our code also shows a 1MB block size. Dont know if we actually used 4MB 
while finding this average record size. Let's revisit this after some more 
testing.
   If the size is not set correctly, it will only make the calculation of #file 
groups slightly inaccurate. Should not have other perf issues.
   
   Assuming 48byte record size (compressed), a 1MB block may be saving 1MB/48 = 
21,845 mappings. 
   
   With 1MB blocks size:
     A dataset with 10 file groups and 100M records will have 480 blocks in 
each HFile.
   
   With 4MB blocks size:
     A dataset with 10 file groups and 100M records will have 120 blocks in 
each HFile.
   
   So with 4MB block size, looking up 100+ keys may load the entire HFile. 
   
   A large HFile block size is not good for RI lookups since keys may be spread 
throughout the file and we need to load the entire block at once. But smaller 
block sizes may lead to less compression. This is workload dependent so hard to 
test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] prashantwason commented on a diff in pull request #8758: [HUDI-53] Implementation of record_index - a HUDI index based on the metadata table.

Reply via email to