alamb commented on PR #7997: URL: https://github.com/apache/arrow-rs/pull/7997#issuecomment-3423414476
> One question I have left is performance, but I think the right place to do that is a follow up PR where you replace the current internal machinery with this, then we can do a main vs. PR type benchmark. I agree -- I expect this code to have exactly the same performance as the current async decoder -- it is the same algorithm, just the state machine(s) are now explicit rather than implicit via `await` > follow up PR where you replace the current internal machinery with this Indeed. Here is that PR: - https://github.com/apache/arrow-rs/pull/8159 > I'm out of time right now, I will come back and take a look at parquet/src/arrow/push_decoder/reader_builder/mod.rs. > How does this fit into the current decoders? Does it replace them for all users? Is it something users have to opt into via code, or a feature flag? I'm sure the is obvious somewhere and I just haven't seen it. Currently this is a entirely new (3rd) decoder. My plan (and I have already done it in https://github.com/apache/arrow-rs/pull/8159) is to rewrite the existing async decoder to use this push decoder. > You mentioned this unlocks the possibility to widen IO by asking the push decoder to "look ahead" at what might need to be decoded next. I want to dig into what that would look like, I guess it would be something like asking RowGroupDecoderState for "what do you think is the next bit of data you will need"? In the StartData or WaitingOnData states that might be obvious / accurate but in the Start, Filters or WaitingOnFilterData that would be either a conservative or aggressive guess. Yes, exactly. Another thing that I would like to do is to make it possible to buffer only some of the pages needed for a row group rather than all of them (aka what is stored in InMemoryRowGroup). This would reduce memory requirements for files with large row groups. However, it would also increase the number of IO requests (aka object store requests) so it would have to be configurable to let people trade off the IOs and the memory requirements -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
