Re: [I] Parallel Parquet Reading [arrow-rs]

via GitHub Tue, 10 Feb 2026 14:13:36 -0800


alamb commented on issue #9381:
URL: https://github.com/apache/arrow-rs/issues/9381#issuecomment-3881023566


   > Do people think this is a problem worth solving? Any suggestions on what a 
good API or implementation would look like? I’m going to take crack at making 
something work, just to explore the space, but would appreciate any input.
   
   What you can do with the APIs today are to create multiple 
ParquetRecordBatchStreams (one stream for each row group, for example) and run 
those streams in parallel . This is at a high level what DataFusion does to 
parallelize the reads from a parquet file -- it makes independent readers
   
   The downside of multiple streams is that each stream will buffer an entire 
row group  and thus require more memory
   
   In your usecase of tiny row groups, that is probably a good tradeoff, but in 
the general case (like DataFusion) it is not always clear that multiple 
concurrent requests is a good idea
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parallel Parquet Reading [arrow-rs]

Reply via email to