jkylling opened a new pull request, #7307: URL: https://github.com/apache/arrow-rs/pull/7307
# Which issue does this PR close? Closes #7299. # What changes are included in this PR? In this PR we: * Add configuration to the `ArrowReaderBuilder` to set a `row_number_column` used to extend the read `RecordBatches` with an additional column with file row numbers. * Keep track of the first row number in each row group in the file. This is computed from the file metadata. * Add an `ArrayReader` to the vector of `ArrayReader`s reading columns from the Parquet file, if the `row_number_column` is set in the reader configuration. This is a `RowNumberReader`, which is a special `ArrayReader`. It reads no data from the Parquet pages, but uses the first row numbers in the `RowGroupMetaData` to keep track of progress. * Add some basic tests and fuzz tests of the functionality. The `RowGroupMetaData::first_row_number` is `Option<i64>`, since it is possible that the row number is unknown (I encountered an instance of this when trying to integrate this PR in [delta-rs](https://github.com/delta-io/delta-rs/blob/8acfa3f1c8ec096dc06a492517d82be10cefbc67/crates/core/src/writer/stats.rs#L113)), and it's better if `None` is used instead of some special integer value. # Are there any user-facing changes? We add an additional public method: * `ArrowReaderBuilder::with_row_number_column` There are a few breaking changes as we touch a few public interfaces: * `RowGroupMetaData::from_thrift` and `RowGroupMetaData::from_thrift_encrypted` takes an additional parameter `first_row_number: Optional<i64>`. * The trait `RowGroups` has an additional method `RowGroups::row_groups`. Potentially this method could replace the `RowGroups::num_rows` method or provide a default implementation for it. * An additional error variant `ParquetError::RowGroupMetaDataMissingRowNumber`. I'm very open to suggestions on how to reduce the amount of breaking changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org