Re: [I] [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets [arrow]

via GitHub Thu, 05 Oct 2023 06:05:24 -0700


lidavidm commented on issue #36765:
URL: https://github.com/apache/arrow/issues/36765#issuecomment-1748860688


   Yeah, admittedly pre-buffer was a bit of a hack to minimize the changes to 
the Parquet reader. Ideally you want the Parquet reader to batch its I/O calls 
(as pre-buffer does) without necessarily caching them. But from what I 
remember, the reader is not designed that way (selecting columns eventually 
leads to a lot of disparate I/O calls far down the stack and you'd have to do a 
bunch of work to untangle that, hence caching was the easiest; that's also why 
the cache doesn't dump memory when things are done - it's hard from this level 
to tell when that time is).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets [arrow]

Reply via email to