[I] Support returning RowGroupIndex as a column from parquet reader [arrow-rs]

via GitHub Fri, 17 Oct 2025 03:49:32 -0700


alamb opened a new issue, #8641:
URL: https://github.com/apache/arrow-rs/issues/8641


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Based on a [discord 
thread](https://discord.com/channels/885562378132000778/1427661435210825880/1428096622193021119)
   
   
   @adriangb suggests it would be very useful to be able to get row groups out 
of a parquet scan, to for example build secondary indexes at the row group 
level.
   
   For example it would be really nice to compute per-row group statistics to 
create a secondary index with a query like this
   
   ```sql
   select 
     row_group_id,  <-- would like to get the row group id from the reader
     min(col), 
     max(col) 
   from t
   group by
    row_group_id
   ```
   
   
   It is currently You can specify row groups one at a time with 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/type.ParquetRecordBatchReaderBuilder.html#method.with_row_groups
   
   
   
   **Describe the solution you'd like**
   I would like a way to specify a "row group" column from the reader. If a 
file contained columns `A` and `B`, with this feature the reader would also 
produce a new column `row_group_index`:
   
   | A | B | `row_group_index` |
   |--------|--------|--------|
   | 10 | v | 0 (first row group) |
   | 20 | d | 0 |
   | 40 | a | 0 |
   | ... | ... | ... |
   | 90 | z | 1 (second row group) |
   | 50 | f | 1 | 
   | ... | ... | ... |
   | 10 | b | 2 (third row group) |
   | 60 | a | 2 | 
   | ... | ... | ... |
   
   
   
   
   
   **Describe alternatives you've considered**
   The hardest part of this issue is figuring out what the API to request this 
extra column should look like
   
   **Additional context**
   Very similar to the request to add row numbers:
    - https://github.com/apache/arrow-rs/issues/7299
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support returning RowGroupIndex as a column from parquet reader [arrow-rs]

Reply via email to