tustvold commented on issue #3632: URL: https://github.com/apache/arrow-rs/issues/3632#issuecomment-1408608866
> Currently this uses more memory than necessary if data would be processed more incrementally It fetches row groups at a time, there isn't a smaller horizontal unit of IO that can be used, and nor is it actually a good idea. Small fetch requests destroy performance. This is the major thing that differentiates DataBrick's closed-source S3 parquet reader from the open source arrow one, and gives it significantly better performance. > Apply limits for row predicate evaluation as well. RowSelection provided this and more already > Concurrent IO / decode with IO prefetching The parquet reader would not be able to make use of this stream abstraction to achieve this as it relies on reading contiguous byte arrays and slicing them up for decode. Concurrent IO is definitely possible, however, an earlier version did this, it was mainly removed as it added additional memory pressure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
