[I] Parallel Parquet Reading [arrow-rs]

via GitHub Mon, 09 Feb 2026 06:32:48 -0800


pmarks opened a new issue, #9381:
URL: https://github.com/apache/arrow-rs/issues/9381


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I want to make many parallel data fetch requests to the underlying object 
store when fetching data with many small row groups. 
   
   This is relevant for few-column queries parquet files with modest-sized row 
groups using high-latency object storage like S3 and R2.
   
   Do people think this is problem worth solving? Any suggestions on what a 
good API would look like?  I’m going to take crack at making something work, 
just to explore the space but would appreciate any input.
   
   
   **Describe the solution you'd like**
   At a super high level the ideal interface would be ParquetRecordBatchStream 
or similar, but where I can configure the number of parallel read requests to 
generate.
   
   **Describe alternatives you've considered**
   I don't have any good ideas for how to get IO parallelism with the current 
types. The sequential nature of row group processing is fairly deeply baked 
into the state-machine architecture.
   
   There are some related issues that touch on this, but the capability of 
having IO for multiple row groups in flight at the same time still appears to 
be unsupported: https://github.com/apache/arrow-rs/issues/5522
   https://github.com/apache/datafusion/pull/18391
   https://github.com/apache/arrow-rs/issues/7983
   https://github.com/apache/arrow-rs/issues/5141
   https://github.com/apache/arrow-rs/pull/6907
   
   **Additional context**
   For example, I have a parquet file where I need to make ~1k reads of 250kB 
to read a particular column. If we assume that the per-request latency of the 
object store is  70ms (as observed for R2 in various benchmarks) and we get 
25MB/s of throughput, then making serial requests will take 1k * 70ms +  1k * 
250kB/(25MB/s) = 70s (latency) + 10s (data transfer).  S3 and R2 scale to many 
parallel GET requests, letting us hide much of the per-request latency, if we 
can parallelize the requests. In a browser I can make 6 parallel requests, so 
we’d expect the total time to come down to ~ 70s/6 + 10s = 21s for my 
particular use case of in-browser parquet viz.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parallel Parquet Reading [arrow-rs]

Reply via email to