jecsand838 opened a new pull request, #8220:
URL: https://github.com/apache/arrow-rs/pull/8220

   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/4886
   - Follows up on https://github.com/apache/arrow-rs/pull/8047
   
   # Rationale for this change
   
   When reading Avro into Arrow with a projection or a reader schema that omits 
some writer fields, we were still decoding those writer‑only fields 
item‑by‑item. This is unnecessary work and can dominate CPU time for large 
arrays/maps or deeply nested records.
   
   Avro’s binary format explicitly allows fast skipping for arrays/maps by 
encoding data in blocks: when the count is negative, the next `long` gives the 
byte size of the block, enabling O(1) skipping of that block without decoding 
each item. This PR teaches the record reader to recognize and leverage that, 
and to avoid constructing decoders for fields we will skip altogether.
   
   # What changes are included in this PR?
   
   **Reader / decoding architecture**
   - **Abstracted skip logic**: Centralized the common logic previously 
duplicated across `skip_blocks_fast` and `read_blockwise_items` into reusable 
helpers. This reduces code duplication and clarifies the control flow for 
arrays/maps.
   - **Skip-aware record decoding**:
     - At construction time, we now precompute per-record **skip decoders** for 
writer fields that the reader will ignore.
     - Introduced a resolved-record path (`RecordResolved`) that carries:
       - `writer_to_reader` mapping for field alignment,
       - a prebuilt list of **skip decoders** for fields not present in the 
reader,
       - the set of active per-field decoders for the projected fields.
   - **Codec builder enhancements**: In `arrow-avro/src/codec.rs`, record 
construction now:
     - Builds Arrow `Field`s and their decoders only for fields that are read,
     - Builds `skip_decoders` (via `build_skip_decoders`) for fields to ignore.
   - **Error handling and consistency**: Kept existing strict-mode behavior; 
improved internal branching to avoid inconsistent states during partial decodes.
   
   **Tests**
   - **Unit tests (in `arrow-avro/src/reader/record.rs`)**
     - Added focused tests that exercise the new skip logic:
       - Skipping writer‑only fields inside **arrays** and **maps** (including 
negative‑count block skipping and mixed multi‑block payloads).
       - Skipping nested structures within records to ensure offsets and 
lengths remain correct for the fields that are read.
       - Ensured nullability and union handling remain correct when adjacent 
fields are skipped.
   - **Integration tests (in `arrow-avro/src/reader/mod.rs`)**
     - Added end‑to‑end test using `avro/alltypes_plain.avro` to validate that 
projecting a subset of fields (reader schema omits some writer fields) both:
       - Produces the correct Arrow arrays for the selected fields, and
       - Avoids decoding skipped fields (validated indirectly via behavior and 
block boundaries).
     - The test covers compressed and uncompressed variants already present in 
the suite to ensure behavior is consistent across codecs.
   
   # Are these changes tested?
   
   - **New unit tests** cover:
     - Fast skipping for arrays/maps using negative block counts and block 
sizes (per Avro spec).
     - Nested and nullable scenarios to ensure correct offsets, validity 
bitmaps, and flush behavior when adjacent fields are skipped.
   - **New integration test** in `reader/mod.rs`:
     - Reads `avro/alltypes_plain.avro` with a reader schema that omits several 
writer fields and asserts the resulting `RecordBatch` matches the expected 
arrays while exercising the skip path.
   - Existing promotion, enum, decimal, fixed, and union tests continue to 
pass, ensuring no regressions in unrelated areas.
   
   # Are there any user-facing changes?
   
   N/A since `arrow-avro` is not public yet.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to