Re: [PR] [Website]: Blog post about arrow-avro [arrow-site]

via GitHub Sat, 18 Oct 2025 00:31:39 -0700


jecsand838 commented on code in PR #712:
URL: https://github.com/apache/arrow-site/pull/712#discussion_r2434723852



##########
_posts/2025-10-17-introducing-arrow-avro.md:
##########
@@ -0,0 +1,246 @@
+---
+layout: post
+title: "Announcing arrow-avro in Arrow Rust"
+description: "A new vectorized reader/writer for Avro native to Arrow, with 
OCF, Single‑Object, and Confluent wire format support."
+date: "2025-10-17 00:00:00"
+author: jecsand838
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+`arrow-avro` is a Rust crate that reads and writes Avro data directly as Arrow 
`RecordBatch`es. It supports Avro Object Container Files (OCF), Single‑Object 
Encoding, and the Confluent Schema Registry wire format, with 
projection/evolution, tunable batch sizing, and an optional `StringViewArray` 
for faster strings. Its vectorized design reduces copies and cache misses, 
making both batch (files) and streaming (Kafka) pipelines simpler and faster.
+
+## Motivation
+
+As a row‑oriented format, Avro is optimized for encoding one record at a time, 
while Apache Arrow is columnar, optimized for vectorized analytics. When Avro 
data is decoded record‑by‑record and then materialized into Arrow arrays, 
systems pay for extra allocations, branches, and cache‑unfriendly memory access 
(exactly the overhead Arrow's design tries to avoid). One example of a 
challenge resulting from this can be found in [DataFusion's Avro 
Datasource](https://github.com/apache/datafusion/tree/main/datafusion/datasource-avro).
 This row to column impedance mismatch caused by decoding Avro into Arrow shows 
up as unnecessary work in hot paths.
+
+Today, DataFusion exposes Avro as a first‑class data source so that users can 
easily read data from Avro files. Under the hood, however, DataFusion has 
maintained its own Avro to Arrow translation layer, a pragmatic design 
necessitated by the absence of a dedicated, upstream Arrow‑first Avro 
reader/writer. Desires to improve this has led to an open discussion about 
moving to an upstream `arrow-avro` implementation to reduce duplication and 
improve performance once available. [discussion: "Consider using upstream 
arrow‑avro reader"](https://github.com/apache/datafusion/issues/14097).
+
+The cost of sticking with a row‑centric path isn't only about CPU cycles; it 
also complicates features like schema projection and evolution. A reader that 
thinks in rows must repeatedly reconstruct column buffers that Arrow prefers to 
build in batches, while a column‑first reader can plan a decode once, reuse 
builders efficiently, and emit `RecordBatch`es that flow directly into 
vectorized operators. In an engine like DataFusion, designed to push down 
projection and operate on Arrow arrays, closing this gap has clear benefits for 
simplicity and throughput.
+
+Modern pipelines magnify these needs because Avro isn't just a file format, it 
also shows up on the wire. Kafka ecosystems commonly use [Confluent Schema 
Registry's wire 
format](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html)
 and many services adopt [Avro Single‑Object 
Encodings](https://avro.apache.org/docs/1.11.1/specification/). In both cases, 
decoding into Arrow batches instead of per‑row values is what lets downstream 
compute stay vectorized and efficient.
+
+Within the Rust Arrow ecosystem, the `arrow-avro` effort was created 
specifically to address this gap by providing a vectorized Avro reader/writer 
that speaks Arrow natively and supports the major Avro container/framing modes 
in common use (Object Container Files, Single‑Object Encoding, and Confluent's 
wire format). Centralizing this logic upstream avoids bespoke, row‑based shims 
scattered across projects and sets a clearer path for evolution and 
maintenance. ([Arrow‑rs issue: "Add Avro 
Support"](https://github.com/apache/arrow-rs/issues/4886).
+
+Development of an Arrow‑first Avro implementation upstream promises a simpler 
surface area for users and a more maintainable code path for contributors while 
opening the door to better integration for both batch and streaming clients.
+
+## Introducing `arrow-avro`
+
+[`arrow-avro`](https://github.com/apache/arrow-rs/tree/main/arrow-avro) is a 
high performance Rust crate that converts between Avro and Arrow with a 
column‑first, batch‑oriented design. On the read path, it decodes Avro Object 
Container Files (OCF), Single‑Object Encoding (SOE), and the Confluent Schema 
Registry wire format directly into Arrow `RecordBatch`es. Meanwhile, the write 
path provides formats for OCF files and SOEs as well.
+
+The crate exposes two primary read APIs: a high-level `Reader` for OCF inputs 
and a low-level `Decoder` for streaming frames. For SOE and Confluent frames, a 
`SchemaStore` is also provided that resolves fingerprints or schema IDs to full 
Avro writer schemas, enabling schema evolution while keeping the decode path 
vectorized.
+
+On the write side, `AvroWriter` produces OCF (including container‑level 
compression), while `AvroStreamWriter` produces framed Avro messages for 
Single‑Object or Confluent encodings, as configured via the 
`WriterBuilder::with_fingerprint_strategy(...)` knob.
+
+Configuration is intentionally minimal but practical. For instance, the 
`ReaderBuilder` exposes knobs covering both batch file ingestion and streaming 
systems without forcing format‑specific code paths.
+
+## Architecture & Technical Overview
+
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start; padding: 20px 15px;">
+<img src="{{ site.baseurl 
}}/img/introducing-arrow-avro/arrow-avro-architecture.svg"
+        width="100%"
+        alt="High-level `arrow-avro` architecture"
+        style="background:#fff">
+</div>
+
+At a high level, 
[`arrow-avro`](https://arrow.apache.org/rust/arrow_avro/index.html) splits 
cleanly into read and write paths built around Arrow `RecordBatch`es. The read 
side turns Avro (OCF files or framed byte streams) into Arrow arrays in 
batches, while the write side takes Arrow batches and produces OCF files or 
streaming frames. When you build an `AvroStreamWriter`, the framing (SOE or 
Confluent) is part of the stream output based on the configured fingerprint 
strategy, no separate framing step required. The public API and module layout 
are intentionally small so most applications only touch a builder, a 
reader/decoder, and (optionally) a schema store for schema evolution while 
streaming.
+
+On the [read](https://arrow.apache.org/rust/arrow_avro/reader/index.html) 
path, everything starts with 
[`ReaderBuilder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.ReaderBuilder.html).
 From a single builder you can create a 
[`Reader`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Reader.html) 
for Object Container Files (OCF) or a streaming 
[`Decoder`](https://arrow.apache.org/rust/arrow_avro/reader/struct.Decoder.html)
 for Single‑Object/Confluent frames. The `Reader` pulls OCF blocks and yields 
Arrow `RecordBatch`es while the `Decoder` is push‑based, i.e. you feed bytes as 
they arrive and then call `flush` to drain completed batches. This design lets 
the same decode plan serve file and streaming use cases with minimal branching.

Review Comment:
   I cleaned this up. The wording I choose for explaining how the same 
underlying decoder logic is shared between file and streaming use-cases was 
poor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Website]: Blog post about arrow-avro [arrow-site]

Reply via email to