Moran created HUDI-9331:
---------------------------
Summary: Incorrect memory estimation for CDC block flushing can
lead to OOM
Key: HUDI-9331
URL: https://issues.apache.org/jira/browse/HUDI-9331
Project: Apache Hudi
Issue Type: Bug
Components: cdc
Reporter: Moran
When writing CDC (Change Data Capture) log, Hudi accumulates records in memory
until the total estimated size reaches {{{}maxBlockSize{}}}. The flush
condition is based on the formula:
{code:java}
numOfCDCRecordsInMemory.get() * averageCDCRecordSize >= maxBlockSize {code}
However, the value of averageCDCRecordSize is only estimated once, during the
first write of CDC data
{code:java}
if (cdcData.isEmpty()) {
averageCDCRecordSize = sizeEstimator.sizeEstimate(payload);
} {code}
This approach can lead to underestimation of memory usage. For instance, if the
first CDC record is relatively small but subsequent records are much larger,
the estimated average size remains inaccurately low. As a result, the number of
records in memory can grow far beyond what would actually fit in maxBlockSize,
potentially leading to OutOfMemoryError (OOM) before the flush is triggered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)