HippoBaro opened a new issue, #9695:
URL: https://github.com/apache/arrow-rs/issues/9695
**Describe the bug**
`PushBuffers::clear_ranges` performs an O(N*M) scan to release consumed
buffers, where N is the number of buffered ranges and M is the number of ranges
to clear. On wide schemas (10k+ columns), this produces quadratic overhead in
`PushDecoder` row group construction.
Additionally, `clear_ranges` matches buffers by exact range equality. When
the IO layer coalesces adjacent requested ranges into fewer, larger fetches,
the coalesced buffer never exactly matches any individual requested range, so
`clear_ranges` silently skips it. The buffer leaks in `PushBuffers` until the
decoder finishes or the caller manually calls `release_all_ranges`, increasing
peak RSS proportionally to the amount of data coalesced ahead of the current
row group.
This puts coalescing in a bind: without it, buffer count scales with range
count and the quadratic `clear_ranges` dominates. With it, memory is not
reclaimed incrementally.
**To Reproduce**
Use `PushDecoder` on a Parquet file with a wide schema (10k+ columns). Push
data without coalesced buffers. Observe row group construction time scaling
quadratically with column count. Conversely if using coalescing, observe memory
not being released by `clear_ranges`, growing RSS until the decoder finishes.
**Expected behavior**
Buffer release should scale with buffer count (not range count), and
coalesced or arbitrarily-sized buffers should be released incrementally as the
decoder progresses.
**Additional context**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]