alamb commented on issue #9381: URL: https://github.com/apache/arrow-rs/issues/9381#issuecomment-3881023566
> Do people think this is a problem worth solving? Any suggestions on what a good API or implementation would look like? I’m going to take crack at making something work, just to explore the space, but would appreciate any input. What you can do with the APIs today are to create multiple ParquetRecordBatchStreams (one stream for each row group, for example) and run those streams in parallel . This is at a high level what DataFusion does to parallelize the reads from a parquet file -- it makes independent readers The downside of multiple streams is that each stream will buffer an entire row group and thus require more memory In your usecase of tiny row groups, that is probably a good tradeoff, but in the general case (like DataFusion) it is not always clear that multiple concurrent requests is a good idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
