Re: [PR] Implement Push Parquet Decoder [arrow-rs]

via GitHub Mon, 20 Oct 2025 12:18:31 -0700


alamb commented on PR #7997:
URL: https://github.com/apache/arrow-rs/pull/7997#issuecomment-3423414476


   > One question I have left is performance, but I think the right place to do 
that is a follow up PR where you replace the current internal machinery with 
this, then we can do a main vs. PR type benchmark.
   
   I agree -- I expect this code to have exactly the same performance as the 
current async decoder -- it is the same algorithm, just the state machine(s) 
are now explicit rather than implicit via `await` 
   
   >  follow up PR where you replace the current internal machinery with this
   
   Indeed. Here is that PR:
   - https://github.com/apache/arrow-rs/pull/8159
   
   
   > I'm out of time right now, I will come back and take a look at 
parquet/src/arrow/push_decoder/reader_builder/mod.rs.
   
   > How does this fit into the current decoders? Does it replace them for all 
users? Is it something users have to opt into via code, or a feature flag? I'm 
sure the is obvious somewhere and I just haven't seen it.
   
   Currently this is a entirely new (3rd) decoder. My plan (and I have already 
done it in https://github.com/apache/arrow-rs/pull/8159) is to rewrite the 
existing async decoder to use this push decoder. 
   
   > You mentioned this unlocks the possibility to widen IO by asking the push 
decoder to "look ahead" at what might need to be decoded next. I want to dig 
into what that would look like, I guess it would be something like asking 
RowGroupDecoderState for "what do you think is the next bit of data you will 
need"? In the StartData or WaitingOnData states that might be obvious / 
accurate but in the Start, Filters or WaitingOnFilterData that would be either 
a conservative or aggressive guess.
   
   Yes, exactly.
   
   Another thing that I would like to do is to make it possible to buffer only 
some of the pages needed for a row group rather than all of them (aka what is 
stored in InMemoryRowGroup). This would reduce memory requirements for files 
with large row groups. However, it would also increase the number of IO 
requests (aka object store requests) so it would have to be configurable to let 
people trade off the IOs and the memory requirements
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement Push Parquet Decoder [arrow-rs]

Reply via email to