Dandandan commented on issue #3632: URL: https://github.com/apache/arrow-rs/issues/3632#issuecomment-1408595975
Hi @tustvold, thanks for the response! This would be for using the Parquet format. We (Coralogix) were seeing some slower response times from queries with limits added, triggering this investigation. No `RowSelection` was present. That was added at our side (or a `RowFilter`), that seems for a large part to "fix" this issue indeed. Data is in the ~1GB per file regime. I believe I often see a bit better latency than 100ms btw on S3 - can be as low as 30-40ms for smaller GET requests. Some directions I still see some benefits of exposing a streaming result rather than collecting to bytes before being able to process. * Reading very large files (without limits). Currently this uses more memory than necessary if data would be processed more incrementally. * Apply limits for row predicate evaluation as well. Currently it still requires fetching large parts of the data, while it might be able to only use a small subset of the data necessary to evaluate it. * Concurrent IO / decode with IO prefetching -> currently we need to wait on the data to be available, which limits the amount of throughput / increases the latency to get the first results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
