[PR] Added arrow-avro schema resolution foundations and type promotion in codec.rs [arrow-rs]

via GitHub Tue, 05 Aug 2025 13:23:12 -0700


jecsand838 opened a new pull request, #8047:
URL: https://github.com/apache/arrow-rs/pull/8047


   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/4886
   
   # Rationale for this change
   
   This change introduces the foundation in `codec.rs` for supporting for Avro 
schema evolution, a key feature of the Avro specification. It enables reading 
Avro data when the writer's schema and the reader's schema do not match exactly 
but are compatible according to Avro's resolution rules. This makes data 
consumption more robust and flexible.
   
   This approach focuses on "annotating" each `AvroDataType` with optional 
`ResolutionInfo` and then building the `Codec` using the `reader_schema`. This 
`ResolutionInfo` will be used downstream in my next PR by the `RecordDecoder` 
to efficiently read and decode the raw record bytes into the `reader_schema`.
   
   Once this is merged in, promotion schema resolution support will need to be 
added to the `RecordDecoder` in a follow-up PR. These `RecordDecoder` updates 
will resemble this:
   
   ```rust
                       Promotion::IntToLong => 
Int32ToInt64(BufferBuilder::new(DEFAULT_CAPACITY)),
                       Promotion::IntToFloat => 
Int32ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
                       Promotion::IntToDouble => 
Int32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
                       Promotion::LongToFloat => 
Int64ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
                       Promotion::LongToDouble => 
Int64ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
                       Promotion::FloatToDouble => {
                           
Float32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY))
                       }
                       Promotion::BytesToString => BytesToString(
                           OffsetBufferBuilder::new(DEFAULT_CAPACITY),
                           BufferBuilder::new(DEFAULT_CAPACITY),
                       ),
                       Promotion::StringToBytes => StringToBytes(
                           OffsetBufferBuilder::new(DEFAULT_CAPACITY),
                           BufferBuilder::new(DEFAULT_CAPACITY),
                       ),
   ```
   
   # What changes are included in this PR?
   
   - **Schema Resolution Logic**: The core of this PR is the new schema 
resolution logic, which is encapsulated in the `Maker` struct. This handles:
       - **Type Promotions**: E.g., promoting `int` to `long` or `string` to 
`bytes`.
       - **Default Values**: Using default values from the reader's schema when 
a field is missing in the writer's schema.
       - **Record Evolution**: Resolving differences in record fields between 
the writer and reader schemas. This includes adding or removing fields.
       - **Enum Evolution**: Mapping enum symbols between the writer's and 
reader's schemas.
   - **New Data Structures**: Several new data structures have been added to 
support schema resolution:
       - `ResolutionInfo`: An enum that captures the necessary information for 
resolving schema differences.
       - `ResolvedRecord`: A struct that holds the mapping between writer and 
reader record fields.
       - `AvroLiteral`: Represents Avro default values.
       - `Promotion`: An enum for different kinds of type promotions.
       - `EnumMapping`: A struct for enum symbol mapping.
   - **Updated `AvroFieldBuilder`**: The `AvroFieldBuilder` has been updated to 
accept both a writer's and an optional reader's schema to facilitate schema 
resolution.
   - **`PartialEq` Derivations**: `PartialEq` has been derived for several 
structs to simplify testing.
   - **Refactoring**: The schema parsing logic has been refactored from a 
standalone function into the new `Maker` struct for better organization.
   
   # Are these changes tested?
   
   Yes, new unit tests have been added to verify the schema resolution logic, 
including tests for type promotions and handling of default values.
   
   # Are there any user-facing changes?
   
   N/A
   
   # Follow-up PRs
   
   - Promotion Schema Resolution support in `RecordDecoder`
   - Default Value Schema resolution support (codec + decoder)
   - Enum mapping Schema resolution support (codec + decoder)
   - Skip Value Schema resolution support (codec + decoder)
   - Record resolution support (codec + decoder)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Added arrow-avro schema resolution foundations and type promotion in codec.rs [arrow-rs]

Reply via email to