[
https://issues.apache.org/jira/browse/HUDI-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-9331:
---------------------------------
Labels: pull-request-available (was: )
> Incorrect memory estimation for CDC block flushing can lead to OOM
> ------------------------------------------------------------------
>
> Key: HUDI-9331
> URL: https://issues.apache.org/jira/browse/HUDI-9331
> Project: Apache Hudi
> Issue Type: Bug
> Components: cdc
> Reporter: Moran
> Priority: Major
> Labels: pull-request-available
>
> When writing CDC (Change Data Capture) log, Hudi accumulates records in
> memory until the total estimated size reaches {{{}maxBlockSize{}}}. The flush
> condition is based on the formula:
> {code:java}
> numOfCDCRecordsInMemory.get() * averageCDCRecordSize >= maxBlockSize {code}
> However, the value of averageCDCRecordSize is only estimated once, during the
> first write of CDC data
> {code:java}
> if (cdcData.isEmpty()) {
> averageCDCRecordSize = sizeEstimator.sizeEstimate(payload);
> } {code}
> This approach can lead to underestimation of memory usage. For instance, if
> the first CDC record is relatively small but subsequent records are much
> larger, the estimated average size remains inaccurately low. As a result, the
> number of records in memory can grow far beyond what would actually fit in
> maxBlockSize, potentially leading to OutOfMemoryError (OOM) before the flush
> is triggered.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)