[
https://issues.apache.org/jira/browse/HUDI-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Moran updated HUDI-9199:
------------------------
Description:
When querying a Merge-On-Read (MOR) Hudi table, the Spark executor encounters
an Out-of-Memory (OOM) error. Upon investigation, it was observed that the root
cause lies in the handling of deltalog files during the read path. Specifically:
# {*}Redundant Copy Operation{*}:
After reading deltalog entries into a {{SpillableMap}} (which is designed to
spill data to disk when memory thresholds are exceeded), an additional
unnecessary *in-memory copy* of the data is performed. This defeats the purpose
of {{SpillableMap}} and forces all data to reside in memory instead of spilling
to disk as intended.
# {*}Key Code Path{*}:
[https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416]
# {*}Impact{*}:
For large deltalog files or tables with frequent updates, this redundant copy
operation leads to excessive memory consumption, ultimately causing OOM errors
and query failures.
was:
When querying a Merge-On-Read (MOR) Hudi table, the Spark executor encounters
an Out-of-Memory (OOM) error. Upon investigation, it was observed that the root
cause lies in the handling of deltalog files during the read path. Specifically:
# {*}Redundant Copy Operation{*}:
After reading deltalog entries into a {{SpillableMap}} (which is designed to
spill data to disk when memory thresholds are exceeded), an additional
unnecessary *in-memory copy* of the data is performed. This defeats the purpose
of {{SpillableMap}} and forces all data to reside in memory instead of spilling
to disk as intended.
# {*}Key Code Path{*}:
https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416
# {*}Impact{*}:
For large deltalog files or tables with frequent updates, this redundant copy
operation leads to excessive memory consumption, ultimately causing OOM errors
and query failures.
# {*}Expected Behavior{*}:
The {{SpillableMap}} should manage memory/disk spilling transparently, and no
unnecessary in-memory copies should occur during the deltalog processing phase.
> OOM when querying MOR table due to redundant copy of deltalog SpillableMap
> into memory
> --------------------------------------------------------------------------------------
>
> Key: HUDI-9199
> URL: https://issues.apache.org/jira/browse/HUDI-9199
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Moran
> Priority: Major
>
> When querying a Merge-On-Read (MOR) Hudi table, the Spark executor encounters
> an Out-of-Memory (OOM) error. Upon investigation, it was observed that the
> root cause lies in the handling of deltalog files during the read path.
> Specifically:
> # {*}Redundant Copy Operation{*}:
> After reading deltalog entries into a {{SpillableMap}} (which is designed to
> spill data to disk when memory thresholds are exceeded), an additional
> unnecessary *in-memory copy* of the data is performed. This defeats the
> purpose of {{SpillableMap}} and forces all data to reside in memory instead
> of spilling to disk as intended.
> # {*}Key Code Path{*}:
> [https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416]
> # {*}Impact{*}:
> For large deltalog files or tables with frequent updates, this redundant copy
> operation leads to excessive memory consumption, ultimately causing OOM
> errors and query failures.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)