HippoBaro opened a new issue, #9695:
URL: https://github.com/apache/arrow-rs/issues/9695

     **Describe the bug**
   
   `PushBuffers::clear_ranges` performs an O(N*M) scan to release consumed 
buffers, where N is the number of buffered ranges and M is the number of ranges 
to clear. On wide schemas (10k+ columns), this produces quadratic overhead in 
`PushDecoder` row group construction.
   
     Additionally, `clear_ranges` matches buffers by exact range equality. When 
the IO layer coalesces adjacent requested ranges into fewer, larger fetches, 
the coalesced buffer never exactly matches any individual requested range, so 
`clear_ranges` silently skips it. The buffer leaks in `PushBuffers` until the 
decoder finishes or the caller manually calls `release_all_ranges`, increasing 
peak RSS proportionally to the amount of data coalesced ahead of the current 
row group.
   
   This puts coalescing in a bind: without it, buffer count scales with range 
count and the quadratic `clear_ranges` dominates. With it, memory is not 
reclaimed incrementally.
   
     **To Reproduce**
   
   Use `PushDecoder` on a Parquet file with a wide schema (10k+ columns). Push 
data without coalesced buffers. Observe row group construction time scaling 
quadratically with column count. Conversely if using coalescing, observe memory 
not being released by `clear_ranges`, growing RSS until the decoder finishes.
   
     **Expected behavior**
   
   Buffer release should scale with buffer count (not range count), and 
coalesced or arbitrarily-sized buffers should be released incrementally as the 
decoder progresses.
   
     **Additional context**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to