hudi-bot opened a new issue, #16919:
URL: https://github.com/apache/hudi/issues/16919

   When querying a Merge-On-Read (MOR) Hudi table, the Spark executor 
encounters an Out-of-Memory (OOM) error. Upon investigation, it was observed 
that the root cause lies in the handling of deltalog files during the read 
path. Specifically:
    # {*}Redundant Copy Operation{*}:
   After reading deltalog entries into a {{SpillableMap}} (which is designed to 
spill data to disk when memory thresholds are exceeded), an additional 
unnecessary *in-memory copy* of the data is performed. This defeats the purpose 
of {{SpillableMap}} and forces all data to reside in memory instead of spilling 
to disk as intended.
    # {*}Key Code Path{*}:
   
[https://github.com/apache/hudi/blob/8b7bd1391bd61600d23b8defed4cdc7d789502d1/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/Iterators.scala#L416]
    # {*}Impact{*}:
   For large deltalog files or tables with frequent updates, this redundant 
copy operation leads to excessive memory consumption, ultimately causing OOM 
errors and query failures.
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-9199
   - Type: Bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to