wirybeaver commented on issue #22758:
URL: https://github.com/apache/datafusion/issues/22758#issuecomment-4677072368

   I've been working on memory accounting improvements and filed #22898 
proposing a three-layer migration path:
   
   1. **Manual `try_grow` patches** — fix under-accounting by adding 
`MemoryReservation` at each untracked allocation site in spill-capable 
operators (NLJ, hash aggregation, external sort). Prototype: 
https://github.com/wirybeaver/datafusion/pull/1
   2. **Post-construction `claim()` via `ArrowMemoryPool`** — use Arrow's 
idempotent `buffer.claim(pool)` API at output boundaries. Eliminates both 
under-accounting (can't miss sites) and over-accounting (shared `Arc<Buffer>` 
counted once). Prototype: https://github.com/wirybeaver/datafusion/pull/3
   3. **Builder `with_pool()` for real-time enforcement** — upstream arrow-rs 
change adding `TrackedVec<T>` so builder growth goes through the pool, 
enforcing limits *before* allocation. Prototype: 
https://github.com/wirybeaver/arrow-rs/pull/1
   
   The key finding is that Arrow builders use `Vec<T>` internally (not 
`MutableBuffer`), so the existing `MutableBuffer` reservation field is bypassed 
during the build phase. Layer 2 (`claim()`) is shippable today on arrow-buffer 
58.3; Layer 3 needs upstream work.
   
   Would love feedback on the approach — details and tradeoff comparison in 
#22898.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to