andygrove opened a new issue, #2661:
URL: https://github.com/apache/datafusion-comet/issues/2661

   ### Describe the bug
   
   Thanks to @EmilyMatt for originally reporting this issue in 
https://github.com/apache/datafusion-comet/pull/2639#issuecomment-3437940048.
   
   Here is an AI-written summary of the issue
   
   # The Core Problem
   
   ## Memory Model Mismatch
   
   When Arrow arrays are passed from Spark (JVM) to DataFusion (Rust):
   - Data buffers: Can be allocated off-heap (native memory)
   - Wrapper objects (ArrowArray, ArrowSchema): Always allocated on Java heap
   - Even though the actual data is off-heap, each ArrayRef needs these small 
heap-allocated wrappers
   
   ## GC Pressure from Buffering Operators
   
   The issue arises with operators like SortExec that consume their entire 
input:
   
   ```rust
     futures::stream::once(async move {
         while let Some(batch) = input.next().await {
             sorter.insert_batch(batch).await?;  // Accumulates all batches
         }
         sorter.sort().await  // Only then produces output
     })
   ```
   
   This pattern:
   1. Consumes the entire input iterator before producing output
   2. Creates many wrapper objects (one set per batch)
   3. Keeps all wrappers alive until sorting completes
   4. Causes severe GC pressure as thousands of small objects accumulate on the 
heap
   
   ## Why It's Worse with Off-Heap Memory
   
   When using off-heap memory for performance, users typically:
   - Reduce executor heap size (since data is off-heap)
   - This makes the heap smaller → GC pressure from wrapper objects becomes 
catastrophic
   - Can cause 10x performance degradation on clusters with large data
   
   ## Why Deep Copy Solves It (The Paradox)
   
   The comment suggests doing a deep copy of ArrayData before make_array in the 
scan. This seems counterintuitive but works because:
   
   1. Immediate Materialization: Deep copy fully materializes data into new 
arrays
   2. Immediate GC: Original wrapper objects can be garbage collected right away
   3. Clean Boundaries: Each batch owns its data completely
   4. Less GC Thrashing: Even though copying costs CPU, it's cheaper than 
continuous GC pauses
   
   Without copy: Many small wrapper objects → constant GC pressure
   With copy: Upfront copy cost → clean memory lifecycle → smooth execution
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to