[I] [arrow-avro] Enhance benchmark suite [arrow-rs]

via GitHub Thu, 23 Oct 2025 15:28:27 -0700


jecsand838 opened a new issue, #8704:
URL: https://github.com/apache/arrow-rs/issues/8704


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   The `arrow-avro` benchmark suite only covers a subset of the supported 
data-types and does not exercise the streaming writer.
   
   Concretely:
   
   - The crate’s **supported decode sites** are listed by 
`arrow_avro::reader::record::Decoder` (i.e., `TimestampNanos`, Arrow 
`Duration(*)`, `RunEndEncoded`, `Union`, `Decimal32/64/256`, etc.). These are 
not all covered by the current `benches/decoder.rs`.
   - The crate’s **supported encode sites** are listed by 
`arrow_avro::writer::encoder::Encoder` (i.e., `Date32`, `Time32SecsToMillis`, 
`Time64Micros`, `Utf8View/BinaryView`, `ListView/LargeListView`, 
`FixedSizeList`, `RunEncoded16/32/64`, `Utf8Large/LargeBinary`, 
`TimestampMillis/Nanos`, etc.). These are not all covered by the current 
`benches/avro_writer.rs`.
   - There is **no benchmark for `AvroStreamWriter`** even though the writer 
module exposes it for SOE streams (a common production path for registry‑based 
pipelines).
   
   Additionally, `benches/decoder.rs` still relies on the external 
**`apache-avro`** crate to generate Avro bytes for the reader benchmarks. Now 
that `arrow-avro` has its own `AvroWriter`, we can remove this extra dependency 
entirely by using the in‑crate writer to generate input payloads for decode 
benches.
   
   **Describe the solution you'd like**
   
   1. **Extend benchmark coverage in `benches/decoder.rs` and 
`benches/avro_writer.rs`.** Based on the current `Decoder` / `Encoder` 
variants, by adding support for the following **types that are missing 
coverage**:
       - **In decoder.rs, add benches for:**
         - `Null`
         - `TimestampNanos`
         - Arrow **`Duration`** units: `DurationSecond`, `DurationMillisecond`, 
`DurationMicrosecond`, `DurationNanosecond`
         - `Decimal32`, `Decimal64`, `Decimal256` (note: `Decimal128` is 
already exercised)
         - `RunEndEncoded`
         - `Union` (dense)
         - All Schema resolution , i.e. `Int32ToFloat64`, etc.
       - **In `avro_writer.rs`, add benches for:**
         - `Null`
         - **Date/Time/Timestamp:** `Date32`, `Time32SecsToMillis`, 
`Time32Millis`, `Time64Micros`, `TimestampMillis`, `TimestampNanos`
         - **View & large‑offset types:** `Utf8View`, `BinaryView`, 
`Utf8Large`, `LargeBinary`
         - **List variants:** `ListView`, `LargeListView`, `FixedSizeList`, 
`LargeList`
         - **Run‑end encoded:** `RunEncoded16`, `RunEncoded32`, `RunEncoded64`
         - **Intervals & Durations:** `IntervalYearMonth`, `IntervalDayTime`, 
`IntervalMonthDayNano` (if not already present), plus Arrow 
`DurationSeconds/Millis/Micros/Nanos`
         - `Union` (dense)
   2. **Add a new `benches/stream.rs` for `AvroStreamWriter`.** Create a new 
benchmark file that emits Avro **Single‑Object Encoding (SOE)** records for 
each supported type using `AvroStreamWriter`, and then decodes them with the 
`Decoder` to measure end‑to‑end streaming performance. This is a common 
production path when using a schema registry, and the writer docs explicitly 
call out `AvroStreamWriter`.
       - Cover all the types listed above (mirroring the static writer 
benches), including nested (`Struct`, `Map`, `List*`, `Union`) and 
fixed/UUID/decimal/interval/duration sites.
       - Include a small “mixed record” schema (several fields across 
categories) to reflect real‑world row composition.
   3. **Remove `apache-avro` from the benches and rely on the in‑crate 
writer.** Refactor `benches/decoder.rs` to **generate input bytes with 
`arrow_avro::writer::AvroWriter` (OCF)** instead of the external `apache-avro` 
crate, and drop that dev‑dependency from the crate entirely. This makes the 
benches self‑contained and tests precisely the Arrow‑native encode/decode paths.
   
   **Describe alternatives you've considered**
   
   - **Status quo**: Keep partial coverage in two benches. This leaves gaps for 
correctness/perf regressions (i.e., nano‑precision timestamps, Arrow 
`Duration(*)`, REE/Union sites) and misses the streaming path entirely.
   - **Only unit tests**: Unit tests validate correctness but don’t capture 
micro‑perf characteristics or regressions across serialization choices (i.e., 
offset widths, view vs. non‑view, REE). Benchmarks give actionable signals over 
time.
   
   **Additional context**
   
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [arrow-avro] Enhance benchmark suite [arrow-rs]

Reply via email to