alamb opened a new issue, #8641: URL: https://github.com/apache/arrow-rs/issues/8641
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Based on a [discord thread](https://discord.com/channels/885562378132000778/1427661435210825880/1428096622193021119) @adriangb suggests it would be very useful to be able to get row groups out of a parquet scan, to for example build secondary indexes at the row group level. For example it would be really nice to compute per-row group statistics to create a secondary index with a query like this ```sql select row_group_id, <-- would like to get the row group id from the reader min(col), max(col) from t group by row_group_id ``` It is currently You can specify row groups one at a time with https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/type.ParquetRecordBatchReaderBuilder.html#method.with_row_groups **Describe the solution you'd like** I would like a way to specify a "row group" column from the reader. If a file contained columns `A` and `B`, with this feature the reader would also produce a new column `row_group_index`: | A | B | `row_group_index` | |--------|--------|--------| | 10 | v | 0 (first row group) | | 20 | d | 0 | | 40 | a | 0 | | ... | ... | ... | | 90 | z | 1 (second row group) | | 50 | f | 1 | | ... | ... | ... | | 10 | b | 2 (third row group) | | 60 | a | 2 | | ... | ... | ... | **Describe alternatives you've considered** The hardest part of this issue is figuring out what the API to request this extra column should look like **Additional context** Very similar to the request to add row numbers: - https://github.com/apache/arrow-rs/issues/7299 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
