tustvold commented on issue #3632:
URL: https://github.com/apache/arrow-rs/issues/3632#issuecomment-1408608866

   > Currently this uses more memory than necessary if data would be processed 
more incrementally
   
   It fetches row groups at a time, there isn't a smaller horizontal unit of IO 
that can be used, and nor is it actually a good idea. Small fetch requests 
destroy performance. This is the major thing that differentiates DataBrick's 
closed-source S3 parquet reader from the open source arrow one, and gives it 
significantly better performance.
   
   > Apply limits for row predicate evaluation as well.
   
   RowSelection provided this and more already
   
   > Concurrent IO / decode with IO prefetching
   
   The parquet reader would not be able to make use of this stream abstraction 
to achieve this as it relies on reading contiguous byte arrays and slicing them 
up for decode. Concurrent IO is definitely possible, however, an earlier 
version did this, it was mainly removed as it added additional memory pressure.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to