jecsand838 opened a new issue, #8704:
URL: https://github.com/apache/arrow-rs/issues/8704
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
The `arrow-avro` benchmark suite only covers a subset of the supported
data-types and does not exercise the streaming writer.
Concretely:
- The crate’s **supported decode sites** are listed by
`arrow_avro::reader::record::Decoder` (i.e., `TimestampNanos`, Arrow
`Duration(*)`, `RunEndEncoded`, `Union`, `Decimal32/64/256`, etc.). These are
not all covered by the current `benches/decoder.rs`.
- The crate’s **supported encode sites** are listed by
`arrow_avro::writer::encoder::Encoder` (i.e., `Date32`, `Time32SecsToMillis`,
`Time64Micros`, `Utf8View/BinaryView`, `ListView/LargeListView`,
`FixedSizeList`, `RunEncoded16/32/64`, `Utf8Large/LargeBinary`,
`TimestampMillis/Nanos`, etc.). These are not all covered by the current
`benches/avro_writer.rs`.
- There is **no benchmark for `AvroStreamWriter`** even though the writer
module exposes it for SOE streams (a common production path for registry‑based
pipelines).
Additionally, `benches/decoder.rs` still relies on the external
**`apache-avro`** crate to generate Avro bytes for the reader benchmarks. Now
that `arrow-avro` has its own `AvroWriter`, we can remove this extra dependency
entirely by using the in‑crate writer to generate input payloads for decode
benches.
**Describe the solution you'd like**
1. **Extend benchmark coverage in `benches/decoder.rs` and
`benches/avro_writer.rs`.** Based on the current `Decoder` / `Encoder`
variants, by adding support for the following **types that are missing
coverage**:
- **In decoder.rs, add benches for:**
- `Null`
- `TimestampNanos`
- Arrow **`Duration`** units: `DurationSecond`, `DurationMillisecond`,
`DurationMicrosecond`, `DurationNanosecond`
- `Decimal32`, `Decimal64`, `Decimal256` (note: `Decimal128` is
already exercised)
- `RunEndEncoded`
- `Union` (dense)
- All Schema resolution , i.e. `Int32ToFloat64`, etc.
- **In `avro_writer.rs`, add benches for:**
- `Null`
- **Date/Time/Timestamp:** `Date32`, `Time32SecsToMillis`,
`Time32Millis`, `Time64Micros`, `TimestampMillis`, `TimestampNanos`
- **View & large‑offset types:** `Utf8View`, `BinaryView`,
`Utf8Large`, `LargeBinary`
- **List variants:** `ListView`, `LargeListView`, `FixedSizeList`,
`LargeList`
- **Run‑end encoded:** `RunEncoded16`, `RunEncoded32`, `RunEncoded64`
- **Intervals & Durations:** `IntervalYearMonth`, `IntervalDayTime`,
`IntervalMonthDayNano` (if not already present), plus Arrow
`DurationSeconds/Millis/Micros/Nanos`
- `Union` (dense)
2. **Add a new `benches/stream.rs` for `AvroStreamWriter`.** Create a new
benchmark file that emits Avro **Single‑Object Encoding (SOE)** records for
each supported type using `AvroStreamWriter`, and then decodes them with the
`Decoder` to measure end‑to‑end streaming performance. This is a common
production path when using a schema registry, and the writer docs explicitly
call out `AvroStreamWriter`.
- Cover all the types listed above (mirroring the static writer
benches), including nested (`Struct`, `Map`, `List*`, `Union`) and
fixed/UUID/decimal/interval/duration sites.
- Include a small “mixed record” schema (several fields across
categories) to reflect real‑world row composition.
3. **Remove `apache-avro` from the benches and rely on the in‑crate
writer.** Refactor `benches/decoder.rs` to **generate input bytes with
`arrow_avro::writer::AvroWriter` (OCF)** instead of the external `apache-avro`
crate, and drop that dev‑dependency from the crate entirely. This makes the
benches self‑contained and tests precisely the Arrow‑native encode/decode paths.
**Describe alternatives you've considered**
- **Status quo**: Keep partial coverage in two benches. This leaves gaps for
correctness/perf regressions (i.e., nano‑precision timestamps, Arrow
`Duration(*)`, REE/Union sites) and misses the streaming path entirely.
- **Only unit tests**: Unit tests validate correctness but don’t capture
micro‑perf characteristics or regressions across serialization choices (i.e.,
offset widths, view vs. non‑view, REE). Benchmarks give actionable signals over
time.
**Additional context**
N/A
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]