Dandandan commented on issue #3632:
URL: https://github.com/apache/arrow-rs/issues/3632#issuecomment-1408595975

   Hi @tustvold, thanks for the response!
   
   This would be for using the Parquet format.
   
   We (Coralogix) were seeing some slower response times from queries with 
limits added, triggering this investigation. No `RowSelection` was present. 
That was added at our side (or a `RowFilter`), that seems for a large part to 
"fix" this issue indeed.
   
   Data is in the ~1GB per file regime. I believe I often see a bit better 
latency than 100ms btw on S3 - can be as low as 30-40ms for smaller GET 
requests.
   
   Some directions I still see some benefits of exposing a streaming result 
rather than collecting to bytes before being able to process.
   
   * Reading very large files (without limits). Currently this uses more memory 
than necessary if data would be processed more incrementally. 
   * Apply limits for row predicate evaluation as well. Currently it still 
requires fetching large parts of the data, while it might be able to only use a 
small subset of the data necessary to evaluate it.
   * Concurrent IO / decode with IO prefetching -> currently we need to wait on 
the data to be available, which limits the amount of throughput / increases the 
latency to get the first results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to