Moran created HUDI-9199:
---------------------------
Summary: OOM when querying MOR table due to redundant copy of
deltalog SpillableMap into memory
Key: HUDI-9199
URL: https://issues.apache.org/jira/browse/HUDI-9199
Project: Apache Hudi
Issue Type: Bug
Reporter: Moran
When querying a Merge-On-Read (MOR) Hudi table, the Spark executor encounters
an Out-of-Memory (OOM) error. Upon investigation, it was observed that the root
cause lies in the handling of deltalog files during the read path. Specifically:
# {*}Redundant Copy Operation{*}:
After reading deltalog entries into a {{SpillableMap}} (which is designed to
spill data to disk when memory thresholds are exceeded), an additional
unnecessary *in-memory copy* of the data is performed. This defeats the purpose
of {{SpillableMap}} and forces all data to reside in memory instead of spilling
to disk as intended.
# {*}Key Code Path{*}:
https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416
# {*}Impact{*}:
For large deltalog files or tables with frequent updates, this redundant copy
operation leads to excessive memory consumption, ultimately causing OOM errors
and query failures.
# {*}Expected Behavior{*}:
The {{SpillableMap}} should manage memory/disk spilling transparently, and no
unnecessary in-memory copies should occur during the deltalog processing phase.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)