Moran created HUDI-9331:
---------------------------

             Summary: Incorrect memory estimation for CDC block flushing can 
lead to OOM
                 Key: HUDI-9331
                 URL: https://issues.apache.org/jira/browse/HUDI-9331
             Project: Apache Hudi
          Issue Type: Bug
          Components: cdc
            Reporter: Moran


When writing CDC (Change Data Capture) log, Hudi accumulates records in memory 
until the total estimated size reaches {{{}maxBlockSize{}}}. The flush 
condition is based on the formula:
{code:java}
numOfCDCRecordsInMemory.get() * averageCDCRecordSize >= maxBlockSize {code}
However, the value of averageCDCRecordSize is only estimated once, during the 
first write of CDC data
{code:java}
if (cdcData.isEmpty()) {
    averageCDCRecordSize = sizeEstimator.sizeEstimate(payload);
} {code}
This approach can lead to underestimation of memory usage. For instance, if the 
first CDC record is relatively small but subsequent records are much larger, 
the estimated average size remains inaccurately low. As a result, the number of 
records in memory can grow far beyond what would actually fit in maxBlockSize, 
potentially leading to OutOfMemoryError (OOM) before the flush is triggered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to