mbutrovich opened a new issue, #20728:
URL: https://github.com/apache/datafusion/issues/20728
### Describe the bug
Commit 3f0b3425c ("Improve sort memory resilience", #19494) added per-chunk
`reservation.try_grow(total_sorted_size)` calls inside the sort's `unfold`
closure (`sort.rs:748-750`). This causes a significant performance regression
for projects that implement custom `MemoryPool` backends where `try_grow` is
not trivially cheap.
In Apache DataFusion Comet, `try_grow` crosses JNI into Spark's
`TaskMemoryManager.acquireExecutionMemory()`, which internally calls
`ExecutionMemoryPool.memoryUsed()` — an O(n) HashMap summation. The result is a
**32x increase** in `acquireMemory` calls compared to DataFusion
51, where the sort path did not have per-chunk reservations.
### To Reproduce
We ran TPC-H SF100 query 21 using DataFusion Comet (Spark accelerator using
DataFusion). Query 21 is SMJ-heavy with multiple sort operators feeding into
sort-merge joins.
- **DF 51**: 695 `acquireMemory` JFR samples
- **DF 52.2**: 22,356 `acquireMemory` JFR samples (~32x increase)
JFR profiling shows the hot path:
CometFairMemoryPool::try_grow
→ JNI → CometTaskMemoryManager.acquireMemory
→ TaskMemoryManager.acquireExecutionMemory
→ ExecutionMemoryPool.memoryUsed() // O(n) HashMap sum
The call originates from `sort.rs` line 748-750:
```rust
reservation
.try_grow(total_sorted_size)
.map_err(Self::err_with_oom_context)?;
This runs inside the unfold closure for every chunk of sorted output.
Expected behavior
Memory accounting should not assume try_grow is free. The pre-#19494 sort
code managed memory without per-chunk try_grow calls. The new granular tracking
is valuable for resilience but the cost should be amortized — *e.g.*, reserving
once for the full sorted output rather than
per-chunk, or providing a way to batch reservations.
Additional context
- Workload: TPC-H SF100 Q21, 2 iterations, JFR captured on executor
- Comet memory pool delegates to Spark via JNI, so each try_grow has
non-trivial overhead
- This likely affects any MemoryPool implementation that does real work in
try_grow (*e.g.*, coordination with external memory managers)
Claude helped me write this issue after investigating. I will open a
counterpart issue on Comet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]