[ 
https://issues.apache.org/jira/browse/HUDI-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-9331:
---------------------------------
    Labels: pull-request-available  (was: )

> Incorrect memory estimation for CDC block flushing can lead to OOM
> ------------------------------------------------------------------
>
>                 Key: HUDI-9331
>                 URL: https://issues.apache.org/jira/browse/HUDI-9331
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: cdc
>            Reporter: Moran
>            Priority: Major
>              Labels: pull-request-available
>
> When writing CDC (Change Data Capture) log, Hudi accumulates records in 
> memory until the total estimated size reaches {{{}maxBlockSize{}}}. The flush 
> condition is based on the formula:
> {code:java}
> numOfCDCRecordsInMemory.get() * averageCDCRecordSize >= maxBlockSize {code}
> However, the value of averageCDCRecordSize is only estimated once, during the 
> first write of CDC data
> {code:java}
> if (cdcData.isEmpty()) {
>     averageCDCRecordSize = sizeEstimator.sizeEstimate(payload);
> } {code}
> This approach can lead to underestimation of memory usage. For instance, if 
> the first CDC record is relatively small but subsequent records are much 
> larger, the estimated average size remains inaccurately low. As a result, the 
> number of records in memory can grow far beyond what would actually fit in 
> maxBlockSize, potentially leading to OutOfMemoryError (OOM) before the flush 
> is triggered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to