jecsand838 commented on code in PR #712:
URL: https://github.com/apache/arrow-site/pull/712#discussion_r2434723852
##########
_posts/2025-10-17-introducing-arrow-avro.md:
##########
@@ -0,0 +1,246 @@
+---
+layout: post
+title: "Announcing arrow-avro in Arrow Rust"
+description: "A new vectorized reader/writer for Avro native to Arrow, with
OCF, Single‑Object, and Confluent wire format support."
+date: "2025-10-17 00:00:00"
+author: jecsand838
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+`arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow
`RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object
Encoding, and the Confluent Schema Registry wire format, with
projection/evolution, tunable batch sizing, and an optional `StringViewArray`
for faster strings. Its vectorized design reduces copies and cache misses,
making both batch (files) and streaming (Kafka) pipelines simpler and faster.
+
+## Motivation
+
+As a row‑oriented format, Avro is optimized for encoding one record at a time,
while Apache Arrow is columnar, optimized for vectorized analytics. When Avro
data is decoded record‑by‑record and then materialized into Arrow arrays,
systems pay for extra allocations, branches, and cache‑unfriendly memory access
(exactly the overhead Arrow's design tries to avoid). One example of a
challenge resulting from this can be found in [DataFusion's Avro
Datasource](https://github.com/apache/datafusion/tree/main/datafusion/datasource-avro).
This row to column impedance mismatch caused by decoding Avro into Arrow shows
up as unnecessary work in hot paths.
+
+Today, DataFusion exposes Avro as a first‑class data source so that users can
easily read data from Avro files. Under the hood, however, DataFusion has
maintained its own Avro to Arrow translation layer, a pragmatic design
necessitated by the absence of a dedicated, upstream Arrow‑first Avro
reader/writer. Desires to improve this has led to an open discussion about
moving to an upstream `arrow-avro` implementation to reduce duplication and
improve performance once available. [discussion: "Consider using upstream
arrow‑avro reader"](https://github.com/apache/datafusion/issues/14097).
+
+The cost of sticking with a row‑centric path isn't only about CPU cycles; it
also complicates features like schema projection and evolution. A reader that
thinks in rows must repeatedly reconstruct column buffers that Arrow prefers to
build in batches, while a column‑first reader can plan a decode once, reuse
builders efficiently, and emit `RecordBatch`es that flow directly into
vectorized operators. In an engine like DataFusion, designed to push down
projection and operate on Arrow arrays, closing this gap has clear benefits for
simplicity and throughput.
+
+Modern pipelines magnify these needs because Avro isn't just a file format, it
also shows up on the wire. Kafka ecosystems commonly use [Confluent Schema
Registry's wire
format](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html)
and many services adopt [Avro Single‑Object
Encodings](https://avro.apache.org/docs/1.11.1/specification/). In both cases,
decoding into Arrow batches instead of per‑row values is what lets downstream
compute stay vectorized and efficient.
+
+Within the Rust Arrow ecosystem, the `arrow-avro` effort was created
specifically to address this gap by providing a vectorized Avro reader/writer
that speaks Arrow natively and supports the major Avro container/framing modes
in common use (Object Container Files, Single‑Object Encoding, and Confluent's
wire format). Centralizing this logic upstream avoids bespoke, row‑based shims
scattered across projects and sets a clearer path for evolution and
maintenance. ([Arrow‑rs issue: "Add Avro
Support"](https://github.com/apache/arrow-rs/issues/4886).
+
+Development of an Arrow‑first Avro implementation upstream promises a simpler
surface area for users and a more maintainable code path for contributors while
opening the door to better integration for both batch and streaming clients.
+
+## Introducing `arrow-avro`
+
+[`arrow-avro`](https://github.com/apache/arrow-rs/tree/main/arrow-avro) is a
high performance Rust crate that converts between Avro and Arrow with a
column‑first, batch‑oriented design. On the read path, it decodes Avro Object
Container Files (OCF), Single‑Object Encoding (SOE), and the Confluent Schema
Registry wire format directly into Arrow `RecordBatch`es. Meanwhile, the write
path provides formats for OCF files and SOEs as well.
+
+The crate exposes two primary read APIs: a high-level `Reader` for OCF inputs
and a low-level `Decoder` for streaming frames. For SOE and Confluent frames, a
`SchemaStore` is also provided that resolves fingerprints or schema IDs to full
Avro writer schemas, enabling schema evolution while keeping the decode path
vectorized.
+
+On the write side, `AvroWriter` produces OCF (including container‑level
compression), while `AvroStreamWriter` produces framed Avro messages for
Single‑Object or Confluent encodings, as configured via the
`WriterBuilder::with_fingerprint_strategy(...)` knob.
+
+Configuration is intentionally minimal but practical. For instance, the
`ReaderBuilder` exposes knobs covering both batch file ingestion and streaming
systems without forcing format‑specific code paths.
+
+## Architecture & Technical Overview
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start; padding: 20px 15px;">
+<img src="{{ site.baseurl
}}/img/introducing-arrow-avro/arrow-avro-architecture.svg"
+ width="100%"
+ alt="High-level `arrow-avro` architecture"
+ style="background:#fff">
+</div>
+
+At a high level,
[`arrow-avro`](https://arrow.apache.org/rust/arrow_avro/index.html) splits
cleanly into read and write paths built around Arrow `RecordBatch`es. The read
side turns Avro (OCF files or framed byte streams) into Arrow arrays in
batches, while the write side takes Arrow batches and produces OCF files or
streaming frames. When you build an `AvroStreamWriter`, the framing (SOE or
Confluent) is part of the stream output based on the configured fingerprint
strategy, no separate framing step required. The public API and module layout
are intentionally small so most applications only touch a builder, a
reader/decoder, and (optionally) a schema store for schema evolution while
streaming.
+
+On the [read](https://arrow.apache.org/rust/arrow_avro/reader/index.html)
path, everything starts with
[`ReaderBuilder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.ReaderBuilder.html).
From a single builder you can create a
[`Reader`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Reader.html)
for Object Container Files (OCF) or a streaming
[`Decoder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Decoder.html)
for Single‑Object/Confluent frames. The `Reader` pulls OCF blocks and yields
Arrow `RecordBatch`es while the `Decoder` is push‑based, i.e. you feed bytes as
they arrive and then call `flush` to drain completed batches. This design lets
the same decode plan serve file and streaming use cases with minimal branching.
Review Comment:
I cleaned this up. The wording I choose for explaining how the same
underlying decoder logic is shared between file and streaming use-cases was
poor.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]