xmakro opened a new issue, #5343:
URL: https://github.com/apache/arrow-rs/issues/5343

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   I'd like to read row groups from a parquet file with random access. Ideally 
both for async and sync API. Currently, it's only possible if all row groups 
are known when constructing the `ParquetRecordBatchStreamBuilder`. In my use 
case however, row groups are fetched with random access during the lifetime of 
the program, and the row group indices are not known ahead of time.
   
   **Describe alternatives you've considered**
   
   Currently, I'm constructing a new `ParquetRecordBatchStreamBuilder` for 
every row group that I'm reading. This has high overhead, as it needs to reopen 
the file and re-read the metadata each time.
   
   **Describe the solution you'd like**
   
   I'm agnostic to the API, but I found the modular API from `arrow2` works 
very well for this use case. Maybe something like:
   
   ```rust
   let reader = tokio::fs::File::open(...).await?;
   let metadata = read_metadata(&mut reader).await?;
   let row_group: RecordBatch = read_row_group(&mut reader, &metadata, 
row_group_index, Some(projection)).await?;
   ```
   
   An alternative would be to add a method that can get the metadata and file 
pointer when finishing a `ParquetRecordBatchStreamBuilder`, and then providing 
a constructor for `ParquetRecordBatchStream` that allows passing the metadata 
along with the file, so that the metadata doesn't have to be re-read. This 
feels more cumbersome.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to