jecsand838 opened a new pull request, #8220: URL: https://github.com/apache/arrow-rs/pull/8220
# Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/4886 - Follows up on https://github.com/apache/arrow-rs/pull/8047 # Rationale for this change When reading Avro into Arrow with a projection or a reader schema that omits some writer fields, we were still decoding those writer‑only fields item‑by‑item. This is unnecessary work and can dominate CPU time for large arrays/maps or deeply nested records. Avro’s binary format explicitly allows fast skipping for arrays/maps by encoding data in blocks: when the count is negative, the next `long` gives the byte size of the block, enabling O(1) skipping of that block without decoding each item. This PR teaches the record reader to recognize and leverage that, and to avoid constructing decoders for fields we will skip altogether. # What changes are included in this PR? **Reader / decoding architecture** - **Abstracted skip logic**: Centralized the common logic previously duplicated across `skip_blocks_fast` and `read_blockwise_items` into reusable helpers. This reduces code duplication and clarifies the control flow for arrays/maps. - **Skip-aware record decoding**: - At construction time, we now precompute per-record **skip decoders** for writer fields that the reader will ignore. - Introduced a resolved-record path (`RecordResolved`) that carries: - `writer_to_reader` mapping for field alignment, - a prebuilt list of **skip decoders** for fields not present in the reader, - the set of active per-field decoders for the projected fields. - **Codec builder enhancements**: In `arrow-avro/src/codec.rs`, record construction now: - Builds Arrow `Field`s and their decoders only for fields that are read, - Builds `skip_decoders` (via `build_skip_decoders`) for fields to ignore. - **Error handling and consistency**: Kept existing strict-mode behavior; improved internal branching to avoid inconsistent states during partial decodes. **Tests** - **Unit tests (in `arrow-avro/src/reader/record.rs`)** - Added focused tests that exercise the new skip logic: - Skipping writer‑only fields inside **arrays** and **maps** (including negative‑count block skipping and mixed multi‑block payloads). - Skipping nested structures within records to ensure offsets and lengths remain correct for the fields that are read. - Ensured nullability and union handling remain correct when adjacent fields are skipped. - **Integration tests (in `arrow-avro/src/reader/mod.rs`)** - Added end‑to‑end test using `avro/alltypes_plain.avro` to validate that projecting a subset of fields (reader schema omits some writer fields) both: - Produces the correct Arrow arrays for the selected fields, and - Avoids decoding skipped fields (validated indirectly via behavior and block boundaries). - The test covers compressed and uncompressed variants already present in the suite to ensure behavior is consistent across codecs. # Are these changes tested? - **New unit tests** cover: - Fast skipping for arrays/maps using negative block counts and block sizes (per Avro spec). - Nested and nullable scenarios to ensure correct offsets, validity bitmaps, and flush behavior when adjacent fields are skipped. - **New integration test** in `reader/mod.rs`: - Reads `avro/alltypes_plain.avro` with a reader schema that omits several writer fields and asserts the resulting `RecordBatch` matches the expected arrays while exercising the skip path. - Existing promotion, enum, decimal, fixed, and union tests continue to pass, ensuring no regressions in unrelated areas. # Are there any user-facing changes? N/A since `arrow-avro` is not public yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org