mbutrovich opened a new issue, #20728:
URL: https://github.com/apache/datafusion/issues/20728

   ### Describe the bug
   Commit 3f0b3425c ("Improve sort memory resilience", #19494) added per-chunk 
`reservation.try_grow(total_sorted_size)` calls inside the sort's `unfold` 
closure (`sort.rs:748-750`). This causes a significant performance regression 
for projects that implement custom `MemoryPool` backends where `try_grow` is 
not trivially cheap.
   
   In Apache DataFusion Comet, `try_grow` crosses JNI into Spark's 
`TaskMemoryManager.acquireExecutionMemory()`, which internally calls 
`ExecutionMemoryPool.memoryUsed()` — an O(n) HashMap summation. The result is a 
**32x increase** in `acquireMemory` calls compared to DataFusion
   51, where the sort path did not have per-chunk reservations.
   
   ### To Reproduce
   
   We ran TPC-H SF100 query 21 using DataFusion Comet (Spark accelerator using 
DataFusion). Query 21 is SMJ-heavy with multiple sort operators feeding into 
sort-merge joins.
   
   - **DF 51**: 695 `acquireMemory` JFR samples
   - **DF 52.2**: 22,356 `acquireMemory` JFR samples (~32x increase)
   
   JFR profiling shows the hot path:
   
   CometFairMemoryPool::try_grow
     → JNI → CometTaskMemoryManager.acquireMemory
       → TaskMemoryManager.acquireExecutionMemory
         → ExecutionMemoryPool.memoryUsed()  // O(n) HashMap sum
   
   The call originates from `sort.rs` line 748-750:
   
   ```rust
   reservation
       .try_grow(total_sorted_size)
       .map_err(Self::err_with_oom_context)?;
   
   This runs inside the unfold closure for every chunk of sorted output.
   
   Expected behavior
   
   Memory accounting should not assume try_grow is free. The pre-#19494 sort 
code managed memory without per-chunk try_grow calls. The new granular tracking 
is valuable for resilience but the cost should be amortized — *e.g.*, reserving 
once for the full sorted output rather than
   per-chunk, or providing a way to batch reservations.
   
   Additional context
   
   - Workload: TPC-H SF100 Q21, 2 iterations, JFR captured on executor
   - Comet memory pool delegates to Spark via JNI, so each try_grow has 
non-trivial overhead
   - This likely affects any MemoryPool implementation that does real work in 
try_grow (*e.g.*, coordination with external memory managers)
   
   Claude helped me write this issue after investigating. I will open a 
counterpart issue on Comet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to