jecsand838 opened a new pull request, #8047: URL: https://github.com/apache/arrow-rs/pull/8047
# Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/4886 # Rationale for this change This change introduces the foundation in `codec.rs` for supporting for Avro schema evolution, a key feature of the Avro specification. It enables reading Avro data when the writer's schema and the reader's schema do not match exactly but are compatible according to Avro's resolution rules. This makes data consumption more robust and flexible. This approach focuses on "annotating" each `AvroDataType` with optional `ResolutionInfo` and then building the `Codec` using the `reader_schema`. This `ResolutionInfo` will be used downstream in my next PR by the `RecordDecoder` to efficiently read and decode the raw record bytes into the `reader_schema`. Once this is merged in, promotion schema resolution support will need to be added to the `RecordDecoder` in a follow-up PR. These `RecordDecoder` updates will resemble this: ```rust Promotion::IntToLong => Int32ToInt64(BufferBuilder::new(DEFAULT_CAPACITY)), Promotion::IntToFloat => Int32ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)), Promotion::IntToDouble => Int32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)), Promotion::LongToFloat => Int64ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)), Promotion::LongToDouble => Int64ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)), Promotion::FloatToDouble => { Float32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)) } Promotion::BytesToString => BytesToString( OffsetBufferBuilder::new(DEFAULT_CAPACITY), BufferBuilder::new(DEFAULT_CAPACITY), ), Promotion::StringToBytes => StringToBytes( OffsetBufferBuilder::new(DEFAULT_CAPACITY), BufferBuilder::new(DEFAULT_CAPACITY), ), ``` # What changes are included in this PR? - **Schema Resolution Logic**: The core of this PR is the new schema resolution logic, which is encapsulated in the `Maker` struct. This handles: - **Type Promotions**: E.g., promoting `int` to `long` or `string` to `bytes`. - **Default Values**: Using default values from the reader's schema when a field is missing in the writer's schema. - **Record Evolution**: Resolving differences in record fields between the writer and reader schemas. This includes adding or removing fields. - **Enum Evolution**: Mapping enum symbols between the writer's and reader's schemas. - **New Data Structures**: Several new data structures have been added to support schema resolution: - `ResolutionInfo`: An enum that captures the necessary information for resolving schema differences. - `ResolvedRecord`: A struct that holds the mapping between writer and reader record fields. - `AvroLiteral`: Represents Avro default values. - `Promotion`: An enum for different kinds of type promotions. - `EnumMapping`: A struct for enum symbol mapping. - **Updated `AvroFieldBuilder`**: The `AvroFieldBuilder` has been updated to accept both a writer's and an optional reader's schema to facilitate schema resolution. - **`PartialEq` Derivations**: `PartialEq` has been derived for several structs to simplify testing. - **Refactoring**: The schema parsing logic has been refactored from a standalone function into the new `Maker` struct for better organization. # Are these changes tested? Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values. # Are there any user-facing changes? N/A # Follow-up PRs - Promotion Schema Resolution support in `RecordDecoder` - Default Value Schema resolution support (codec + decoder) - Enum mapping Schema resolution support (codec + decoder) - Skip Value Schema resolution support (codec + decoder) - Record resolution support (codec + decoder) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org