jecsand838 opened a new pull request, #9162:
URL: https://github.com/apache/arrow-rs/pull/9162
# Which issue does this PR close?
- Closes #8923.
# Rationale for this change
The `arrow-avro` crate's `ReaderBuilder` previously lacked the ability to
project (select) specific columns when reading Avro files. This is a common
feature in other Arrow readers (like `arrow-csv` and `arrow-ipc`) that enables
users to read only the columns they need, improving performance and reducing
memory usage.
# What changes are included in this PR?
- Added a `with_projection(projection: Vec<usize>)` method to
`ReaderBuilder` that accepts zero-based column indices
- Implemented `AvroSchema::project()` method to create a projected Avro
schema with only the selected fields
- The projection supports:
- Selecting a subset of fields
- Reordering fields
- Preserving all record and field metadata (namespace, doc, defaults,
aliases, etc.)
- Preserving nested/complex types (records, arrays, maps, unions)
- Added validation for out-of-bounds indices and duplicate indices
# Are these changes tested?
Yes, comprehensive tests have been added:
- Unit tests for `AvroSchema::project()` covering:
- Empty projections
- Single and multiple field selection
- Field reordering
- Metadata preservation (record-level and field-level)
- Nested records and complex types (arrays, maps, unions)
- Error cases (invalid JSON, non-record schemas, out-of-bounds indices,
duplicate indices)
- Integration tests in the reader module for end-to-end projection with OCF
files
# Are there any user-facing changes?
Yes, this adds a new public API method:
```rust
impl ReaderBuilder {
/// Set a projection of columns to read (zero-based column indices).
pub fn with_projection(self, projection: Vec<usize>) -> Self
}
```
This is consistent with the projection API in `arrow-csv::ReaderBuilder` and
`arrow-ipc::FileReaderBuilder`. There are no breaking changes to existing APIs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]