alamb commented on PR #7997:
URL: https://github.com/apache/arrow-rs/pull/7997#issuecomment-3423671554

   > > Another thing that I would like to do is to make it possible to buffer 
only some of the pages needed for a row group rather than all of them (aka what 
is stored in InMemoryRowGroup). This would reduce memory requirements for files 
with large row groups. However, it would also increase the number of IO 
requests (aka object store requests) so it would have to be configurable to let 
people trade off the IOs and the memory requirements
   > 
   > IMO the parquet decoder should produce _as granular as possible_ ranges of 
data to read and the object store implementation can handle coalescing them as 
needed. Currently some object stores (remote ones that talk to S3, GCS, etc.) 
already coalesce ranges to be more efficient, but to what degree is hidden 
behind hardcoded constants 
(https://github.com/apache/arrow-rs-object-store/blob/ad1d70f4876b0c2ea6c6a5e34dc158c63f861384/src/util.rs#L90-L95).
 Maybe that's what should be used to tweak the tradeoff between IO round-trips 
and memory? I guess we still need to decide _when_ to go to hit object storage 
(i.e. how many ranges or bytes we accumulate before making a request?).
   
   I agree it would be good to provide granular requests for data, but I think 
it is orthogonal to the 'what data to wait for until decoding can start'
   
   Right now, the readers (including the push decoder) will wait until all the 
data for a RowGroup (after filtering) is fetched
   
   Once the decoder can tell the caller how much data it really needs to decode 
something, then I think we'll be in a much better position to control CPU vs 
Memory


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to