vinishjail97 opened a new issue, #14077:
URL: https://github.com/apache/hudi/issues/14077

   ### Bug Description
   
   **What happened:**
     Out of Memory (OOM) errors occur when building SI on large tables with 
Secondary Index enabled. The error manifests during metadata table write 
operations:
   
   
   **What you expected:**
   Can we avoid populating in-memory hash-maps and lists and then return 
iterator? We can directly use iterator and avoid building memory pressure?  
   
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java#L200
 
   
   **Steps to reproduce:**
   Build SI for 100GB table with parquet files of size 100MB+
   
   ### Environment
   
     - Hudi Version: 1.0.0+ (any version with Secondary Index support)
     - Spark Version: 3.5.x
     - Table Type: MOR or COW
     - Table Size: 10M+ records, 100+ files
     - Heap Size: Standard executor memory (not enough for non-streaming 
approach)
   
   
   
   ### Logs and Stack Trace
   
     java.lang.OutOfMemoryError: Java heap space
         at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
         at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)
         at 
org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:96)
         at 
org.apache.hudi.avro.HoodieAvroUtils.indexedRecordToBytesStream(HoodieAvroUtils.java:152)
         at 
org.apache.hudi.common.util.HFileUtils.serializeRecordsToLogBlock(HFileUtils.java:221)
         at 
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:501)
         at 
org.apache.hudi.io.HoodieAppendHandle.flushToDiskIfRequired(HoodieAppendHandle.java:681)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to