Re: [PR] fix: Fix the buffer capacity calculation for hfile writer [hudi]

via GitHub Sat, 18 Oct 2025 07:37:01 -0700


yihua commented on code in PR #13977:
URL: https://github.com/apache/hudi/pull/13977#discussion_r2373564076



##########
hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileFileInfoBlock.java:
##########
@@ -80,15 +80,26 @@ public HFileInfo readFileInfo() throws IOException {
 
   public void add(String name, byte[] value) {
     fileInfoToWrite.put(name, value);
+    // For file info, all entries are put into the same map.
+    // Therefore, we should sum their size.
+    longestEntrySize += getUTF8Bytes(name).length + value.length;
   }
 
   public boolean containsKey(String name) {
     return fileInfoToWrite.containsKey(name);
   }
 
+  @Override
+  protected int calculateBufferCapacity() {
+    // Based on the internet search, the overhead of a BytesBytesPair is about 
30 bytes.
+    // 4 bytes for magic number.
+    // To be safe, we use extra 50 bytes here.
+    return longestEntrySize + 50 * fileInfoToWrite.size();
+  }
+
   @Override
   public ByteBuffer getUncompressedBlockDataToWrite() {
-    ByteBuffer buff = ByteBuffer.allocate(context.getBlockSize() * 2);
+    ByteBuffer buff = ByteBuffer.allocate(calculateBufferCapacity());

Review Comment:
   Should we use `ByteArrayOutputStream` that can dynamically extend the byte 
array if the data to write exceeds the capacity for better error handling, 
instead of byte buffer overflow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Fix the buffer capacity calculation for hfile writer [hudi]

Reply via email to