alamb commented on PR #9697: URL: https://github.com/apache/arrow-rs/pull/9697#issuecomment-4281589075
> This gives us fairly clean decoupling: the IO layer can do whatever it wants, but can't push unsolicited buffers, which seems like a reasonable constraint. > WDYT @alamb? I think pushing unsolicated buffers is important to support a prefetched usecase, though it is not clear if your proposal precludes that. I also worry it will be complicated to track the exact ranges needed / not needed, and it adds a new non trivial constrant on the decoder to do range tracking. I really liked the thoretical simplicity of your initial watermark and I feel like we should be able to leverage the fact that the biggest unit of buffering is a row group. As soon as the decoder is done with a row group, any data pre-fetched for it can be released. Maybe we could add some way for the decoder to report at a higher granularity what might still be requested. For example, maybe we could add an API ilke `PushDecoder::remaining_row_groups` -- that returns row groups that the decoder may still read in the future. and the I/O subsystem can handle the mapping of whatever I/O prefetching it has done to those row groups, and when a row group is no longer needed it can cancel I/O and/or flush or whatever 🤔 And with the ability to clear the buffer sin the push decoder you can control memory very fine grained -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
