hudi-bot opened a new issue, #16919:
URL: https://github.com/apache/hudi/issues/16919
When querying a Merge-On-Read (MOR) Hudi table, the Spark executor
encounters an Out-of-Memory (OOM) error. Upon investigation, it was observed
that the root cause lies in the handling of deltalog files during the read
path. Specifically:
# {*}Redundant Copy Operation{*}:
After reading deltalog entries into a {{SpillableMap}} (which is designed to
spill data to disk when memory thresholds are exceeded), an additional
unnecessary *in-memory copy* of the data is performed. This defeats the purpose
of {{SpillableMap}} and forces all data to reside in memory instead of spilling
to disk as intended.
# {*}Key Code Path{*}:
[https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416]
# {*}Impact{*}:
For large deltalog files or tables with frequent updates, this redundant
copy operation leads to excessive memory consumption, ultimately causing OOM
errors and query failures.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-9199
- Type: Bug
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]