pmarks opened a new issue, #9381: URL: https://github.com/apache/arrow-rs/issues/9381
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I want to make many parallel data fetch requests to the underlying object store when fetching data with many small row groups. This is relevant for few-column queries parquet files with modest-sized row groups using high-latency object storage like S3 and R2. Do people think this is problem worth solving? Any suggestions on what a good API would look like? I’m going to take crack at making something work, just to explore the space but would appreciate any input. **Describe the solution you'd like** At a super high level the ideal interface would be ParquetRecordBatchStream or similar, but where I can configure the number of parallel read requests to generate. **Describe alternatives you've considered** I don't have any good ideas for how to get IO parallelism with the current types. The sequential nature of row group processing is fairly deeply baked into the state-machine architecture. There are some related issues that touch on this, but the capability of having IO for multiple row groups in flight at the same time still appears to be unsupported: https://github.com/apache/arrow-rs/issues/5522 https://github.com/apache/datafusion/pull/18391 https://github.com/apache/arrow-rs/issues/7983 https://github.com/apache/arrow-rs/issues/5141 https://github.com/apache/arrow-rs/pull/6907 **Additional context** For example, I have a parquet file where I need to make ~1k reads of 250kB to read a particular column. If we assume that the per-request latency of the object store is 70ms (as observed for R2 in various benchmarks) and we get 25MB/s of throughput, then making serial requests will take 1k * 70ms + 1k * 250kB/(25MB/s) = 70s (latency) + 10s (data transfer). S3 and R2 scale to many parallel GET requests, letting us hide much of the per-request latency, if we can parallelize the requests. In a browser I can make 6 parallel requests, so we’d expect the total time to come down to ~ 70s/6 + 10s = 21s for my particular use case of in-browser parquet viz. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
