andygrove opened a new issue, #2661: URL: https://github.com/apache/datafusion-comet/issues/2661
### Describe the bug Thanks to @EmilyMatt for originally reporting this issue in https://github.com/apache/datafusion-comet/pull/2639#issuecomment-3437940048. Here is an AI-written summary of the issue # The Core Problem ## Memory Model Mismatch When Arrow arrays are passed from Spark (JVM) to DataFusion (Rust): - Data buffers: Can be allocated off-heap (native memory) - Wrapper objects (ArrowArray, ArrowSchema): Always allocated on Java heap - Even though the actual data is off-heap, each ArrayRef needs these small heap-allocated wrappers ## GC Pressure from Buffering Operators The issue arises with operators like SortExec that consume their entire input: ```rust futures::stream::once(async move { while let Some(batch) = input.next().await { sorter.insert_batch(batch).await?; // Accumulates all batches } sorter.sort().await // Only then produces output }) ``` This pattern: 1. Consumes the entire input iterator before producing output 2. Creates many wrapper objects (one set per batch) 3. Keeps all wrappers alive until sorting completes 4. Causes severe GC pressure as thousands of small objects accumulate on the heap ## Why It's Worse with Off-Heap Memory When using off-heap memory for performance, users typically: - Reduce executor heap size (since data is off-heap) - This makes the heap smaller → GC pressure from wrapper objects becomes catastrophic - Can cause 10x performance degradation on clusters with large data ## Why Deep Copy Solves It (The Paradox) The comment suggests doing a deep copy of ArrayData before make_array in the scan. This seems counterintuitive but works because: 1. Immediate Materialization: Deep copy fully materializes data into new arrays 2. Immediate GC: Original wrapper objects can be garbage collected right away 3. Clean Boundaries: Each batch owns its data completely 4. Less GC Thrashing: Even though copying costs CPU, it's cheaper than continuous GC pauses Without copy: Many small wrapper objects → constant GC pressure With copy: Upfront copy cost → clean memory lifecycle → smooth execution ### Steps to reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
