xmakro opened a new issue, #5343: URL: https://github.com/apache/arrow-rs/issues/5343
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I'd like to read row groups from a parquet file with random access. Ideally both for async and sync API. Currently, it's only possible if all row groups are known when constructing the `ParquetRecordBatchStreamBuilder`. In my use case however, row groups are fetched with random access during the lifetime of the program, and the row group indices are not known ahead of time. **Describe alternatives you've considered** Currently, I'm constructing a new `ParquetRecordBatchStreamBuilder` for every row group that I'm reading. This has high overhead, as it needs to reopen the file and re-read the metadata each time. **Describe the solution you'd like** I'm agnostic to the API, but I found the modular API from `arrow2` works very well for this use case. Maybe something like: ```rust let reader = tokio::fs::File::open(...).await?; let metadata = read_metadata(&mut reader).await?; let row_group: RecordBatch = read_row_group(&mut reader, &metadata, row_group_index, Some(projection)).await?; ``` An alternative would be to add a method that can get the metadata and file pointer when finishing a `ParquetRecordBatchStreamBuilder`, and then providing a constructor for `ParquetRecordBatchStream` that allows passing the metadata along with the file, so that the metadata doesn't have to be re-read. This feels more cumbersome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
