HippoBaro commented on PR #9697:
URL: https://github.com/apache/arrow-rs/pull/9697#issuecomment-4248638163

   @alamb @nathanb9 Thank you both for your feedback — it's been very helpful.
   
   The watermark mechanism is fundamentally flawed. I worked under the 
assumption that all Parquet files are laid out like mine are, but the spec 
makes no such guarantee: "There is no physical structure that is guaranteed for 
a row group." Column chunks within a row group need not be contiguous or 
ordered, so a monotonic watermark won't be correct in general. @nathanb9's 
reverse-scan regression is one concrete consequence, but it's not the only one. 
The deletion trick is actually quite neat.
   
   > I was thinking more about this change and I thought maybe we could take a 
step back and figure out what you are trying to accomplish
   
   Fair point. I should have been clearer in the original description. Here are 
the high-level goals driving this work:
   * **Linear scaling with column count.** My schemas are very high 
cardinality. Loads of columns. `PushBuffers` must scale linearly (ideally 
sub-linearly) with the number of columns. The current implementation scales 
with the number of requested ranges from the reader, which can reach millions 
on adversarial inputs.
   * **Total decoupling from the IO layer.** The decoder should own all logic 
related to *what* to read; the IO layer should be exclusively concerned with 
*how* best to exploit the characteristics of the underlying storage medium 
along whatever axis the user cares about (throughput, latency, cost). The 
current design leaks information across that boundary: buffers have to align 
with logical range requests or they leak, and can't straddle request boundaries.
   * **Coalescing should not be load bearing.**  `PushBuffers` should scale 
sub-linearly to the number of buffers it's given. Coalescing should be decided 
based on the characteristics of storage, not because the code can't deal with 
too many small buffers.
   * **First-class support for prefetching.** My data lives in blob storage. 
The economics of blob storage incentivize pulling as much as possible: you pay 
the toll per GET request. If we need two row groups separated by 100 MiB of 
pruned row groups, it is most likely cheapest to stream all of it and discard 
than to seek. There are also cases where the reader doesn't know which row 
groups to pull next, because that depends on the content of the current row 
group — i.e., there are data dependencies. So prefetching can't simply be "give 
a list of row groups of interest ahead of time."
   * **No leaks.** Whatever the IO layer does (coalescing, prefetch, etc.), 
regardless of access pattern or pruning ratio, we should never leak buffers. 
This doesn't mean every byte must be freed at the earliest opportunity, but at 
row-group granularity, everything should be released when we move on to the 
next. A single decoder should be able to stream-decode TBs worth of data, and 
the user should have RSS guarantees derived from the max row-group size of the 
data they are dealing with (and modulo how aggressive prefetching is allowed to 
be, which they can easily control).
   
   > if your usecase is to clear all IO after reading a row group's data, I 
wonder if we could you just call clear_all_buffers after reading the row 
group's data? 
   
   The problem is that by the time I finish decoding a row group, I only know 
I'm done with *that* row group. The IO layer may have already prefetched data 
I'm about to need. Calling `release_all` would throw away that prefetched work.
   
   The watermark was a nice idea for this: if we know that after reading byte 
offset N we'll never read anything below N, we can prefetch freely above N and 
safely release everything below. But since we can't guarantee monotonic access, 
that doesn't hold.
   
   I'll push a revised version that addresses all the feedback shortly!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to