Moran created HUDI-9199:
---------------------------

             Summary: OOM when querying MOR table due to redundant copy of 
deltalog SpillableMap into memory
                 Key: HUDI-9199
                 URL: https://issues.apache.org/jira/browse/HUDI-9199
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Moran


When querying a Merge-On-Read (MOR) Hudi table, the Spark executor encounters 
an Out-of-Memory (OOM) error. Upon investigation, it was observed that the root 
cause lies in the handling of deltalog files during the read path. Specifically:
 # {*}Redundant Copy Operation{*}:
After reading deltalog entries into a {{SpillableMap}} (which is designed to 
spill data to disk when memory thresholds are exceeded), an additional 
unnecessary *in-memory copy* of the data is performed. This defeats the purpose 
of {{SpillableMap}} and forces all data to reside in memory instead of spilling 
to disk as intended.

 # {*}Key Code Path{*}:
https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416
 # {*}Impact{*}:
For large deltalog files or tables with frequent updates, this redundant copy 
operation leads to excessive memory consumption, ultimately causing OOM errors 
and query failures.

 # {*}Expected Behavior{*}:
The {{SpillableMap}} should manage memory/disk spilling transparently, and no 
unnecessary in-memory copies should occur during the deltalog processing phase.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to