mbutrovich commented on issue #20728: URL: https://github.com/apache/datafusion/issues/20728#issuecomment-4007043460
Thanks for the quick response @EmilyMatt! You're right that `try_grow` is called once per `sort_batch_stream` invocation, not per chunk. The issue is that `sort_batch_stream` is called [once per in-memory batch in the multi-batch merge path](https://github.com/apache/datafusion/blob/13cebf8432a3c649ebc3426933753e159b2b7b9a/datafusion/physical-plan/src/sorts/sort.rs#L687), and the old code used `drop(reservation)` — which for most `MemoryPool` implementations is very cheap (just decrementing a counter, no pool-level reacquisition needed). The new code does `reservation.free()` + `reservation.try_grow(total_sorted_size)`, which is two full pool roundtrips where there were previously zero. In Comet's case each pool call crosses JNI into Spark's `TaskMemoryManager.acquireExecutionMemory()`, which is expensive — but this would affect any `MemoryPool` implementation where `try_grow`/`shrink` do real work (coordination with an external memory manager, logging, etc.). We confirmed this by replacing `free()` + `try_grow()` with `try_resize()`, which only interacts with the pool for the delta (typically near-zero since sorted output ≈ input size). The `acquireMemory` samples dropped from 22,356 to 451 (back to DF51 levels), and wall-clock time improved. I should have full benchmarks shortly. In the meantime I have a draft PR up already: #20729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
