jecsand838 opened a new pull request, #7852:
URL: https://github.com/apache/arrow-rs/pull/7852

   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/4886
   
   - Related to https://github.com/apache/arrow-rs/pull/6965
   
   # Rationale for this change
   
   The `arrow-avro` crate currently lacks support for the Avro `enum` type, 
which is a standard and commonly used type in Avro schemas. This omission 
prevents users from reading Avro files containing enums, limiting the crate's 
utility.
   
   This change introduces support for decoding Avro enums by mapping them to 
the Arrow `DictionaryArray` type. This is a logical and efficient 
representation. Implementing this feature brings the `arrow-avro` crate closer 
to full Avro specification compliance and makes it more robust for real-world 
use cases.
   
   # What changes are included in this PR?
   
   This PR introduces comprehensive support for Avro enum decoding along with a 
minor Avro decimal decoding fix. The key changes are:
   
   1.  **Schema Parsing (`codec.rs`):**
       *   A new `Codec::Enum(Arc<[String]>)` variant was added to represent a 
parsed enum and its associated symbols.
       *   The `make_data_type` function now parses `ComplexType::Enum` 
schemas. It also stores the original symbols as a JSON string in the `Field`'s 
metadata under the key `"avro.enum.symbols"` to ensure schema fidelity and 
enable lossless round-trip conversions.
       *   The `Codec::data_type` method was updated to map the internal 
`Codec::Enum` to the corresponding Arrow `DataType::Dictionary(Box<Int32>, 
Box<Utf8>)`.
   
   2.  **Decoding Logic (`reader/record.rs`):**
       *   A new `Decoder::Enum(Vec<i32>, Arc<[String]>)` variant was added to 
manage the state of decoding enum values.
       *   The `Decoder` was enhanced to create, decode, and flush `Enum` types:
           *   `try_new` creates the decoder.
           *   `decode` reads the Avro `int` index from the byte buffer.
           *   `flush` constructs the final `DictionaryArray<Int32Type>` using 
the collected indices as keys and the stored symbols as the dictionary values.
           *   `append_null` was extended to handle nullable enums.
   
   3.  **Minor Decimal Type Decoding Fix (`codec.rs`)**
       *   A minor decimal decoding fix was implemented in `make_data_type` due 
to the `(Some("decimal"), c @ Codec::Fixed(sz))` branch of `match 
(t.attributes.logical_type, &mut field.codec)` not being reachable. This issue 
was caught by the new decimal integration tests in 
`arrow-avro/src/reader/mod.rs`.
   
   # Are these changes tested?
   
   *   Yes, test coverage was provided for the new `Enum` type: 
       *   New unit tests were added to `record.rs` to specifically validate 
both non-nullable and nullable enum decoding logic.
       *   The existing integration test suite in 
`arrow-avro/src/reader/mod.rs` was used to validate the end-to-end 
functionality with a new `avro/simple_enum.avro` test case, ensuring 
compatibility with the overall reader infrastructure.
    *  New tests were also included for the `Decimal` and `Fixed` types:
       *   This integration test suite was also extended to include tests for 
`avro/simple_fixed.avro`, `avro/fixed_length_decimal.avro`, 
`avro/fixed_length_decimal_legacy.avro`, `avro/int32_decimal.avro`, 
`avro/int64_decimal.avro`
   
   # Are there any user-facing changes?
   
   N/a
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to