Re: [PR] fix: Fix the buffer capacity calculation for hfile writer [hudi]

via GitHub Tue, 23 Sep 2025 15:14:13 -0700


yihua commented on code in PR #13977:
URL: https://github.com/apache/hudi/pull/13977#discussion_r2373568824



##########
hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileFileInfoBlock.java:
##########
@@ -80,15 +80,26 @@ public HFileInfo readFileInfo() throws IOException {
 
   public void add(String name, byte[] value) {
     fileInfoToWrite.put(name, value);
+    // For file info, all entries are put into the same map.
+    // Therefore, we should sum their size.
+    longestEntrySize += getUTF8Bytes(name).length + value.length;
   }
 
   public boolean containsKey(String name) {
     return fileInfoToWrite.containsKey(name);
   }
 
+  @Override
+  protected int calculateBufferCapacity() {
+    // Based on the internet search, the overhead of a BytesBytesPair is about 
30 bytes.
+    // 4 bytes for magic number.
+    // To be safe, we use extra 50 bytes here.
+    return longestEntrySize + 50 * fileInfoToWrite.size();
+  }
+
   @Override
   public ByteBuffer getUncompressedBlockDataToWrite() {
-    ByteBuffer buff = ByteBuffer.allocate(context.getBlockSize() * 2);
+    ByteBuffer buff = ByteBuffer.allocate(calculateBufferCapacity());

Review Comment:
   Then we can simplify the block write logic by simplify writing the key-value 
entry.  One factor to control is to avoid growing byte array capacity in most 
cases to reduce copy overhead, if the estimation is correct.
   
   How does HBase HFile writer manage the byte array or buffer for writes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Fix the buffer capacity calculation for hfile writer [hudi]

Reply via email to