Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

via GitHub Thu, 28 May 2026 14:54:45 -0700


Kurtiscwright commented on code in PR #2384:
URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3320887544



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,510 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+**Authors:** Kurtis C. Wright
+**Last updated:** 2026-05-20
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.1, Rust 1.92) is the core library of the 
Apache Iceberg Rust project. The crate depends directly on the `parquet` crate 
(with the `async` feature) and on the `arrow-*` crates. It has no feature flags 
today.
+
+For data file writing, the crate provides `FileWriter` / `FileWriterBuilder` 
traits that are format-agnostic at the type level, but `ParquetWriterBuilder` 
and `ParquetWriter` are the only implementation. Higher-level writers 
(`DataFileWriterBuilder`, `EqualityDeleteFileWriterBuilder`) are generic over 
any `FileWriterBuilder`, but every instantiation uses Parquet.
+
+For data file reading, `ArrowReaderBuilder` and `ArrowReader` are 
Parquet-specific despite the generic name. `TableScan::to_arrow` wires 
`ArrowReaderBuilder` as the only reader path. `FileScanTask` carries a 
`data_file_format` field, but the reader ignores it.
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read by the `iceberg` 
crate today, even though both are valid per the Iceberg spec.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` and threading format-specific logic through every layer.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options meaningless for other formats. The name 
conflates the in-memory representation with the on-disk format.
+
+3. **No format-agnostic statistics.** Statistics computation is tightly 
coupled to Parquet's `Statistics` type.
+
+4. **V3 types will need per-format serialization.** Variant uses shredding in 
Parquet, binary in ORC, unions in Avro. Without a format abstraction, each new 
type means new `match` arms everywhere.
+
+5. **Arrow version coupling.** The core crate depends on specific `arrow-*` 
versions. Upgrading Arrow in `datafusion` or other integrations forces lockstep 
upgrades across the dependency graph.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)). Java's design uses two 
generic parameters (data type `D` and engine schema `S`) with a registry keyed 
by `(FileFormat, Class<?>)`. PyIceberg has an open proposal 
([#3100](https://github.com/apache/iceberg-python/issues/3100)) that drops 
generics entirely, keying on file format alone.
+
+This RFC proposes a composable three-layer architecture that separates the 
in-memory processing representation from the file format layer, using Rust's 
trait system for static dispatch within a layer and dynamic dispatch at layer 
boundaries. It aligns with the kernel architecture proposed in 
[#1817](https://github.com/apache/iceberg-rust/issues/1817) and the 
modularization tracked in 
[#1819](https://github.com/apache/iceberg-rust/issues/1819).
+
+## Goals
+
+1. Define a composable three-layer architecture where the file format, 
in-memory processing representation, and engine are independent axes of 
variation. A `DataBatch` trait defines the processing contract. `FormatReader` 
and `FormatWriter` traits bridge file formats to batch types. No layer imposes 
a conversion on another.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration.
+
+3. Establish the crate architecture that allows the core `iceberg` kernel to 
be representation-agnostic, decoupling Arrow version pinning from the core 
library.
+
+4. Provide interoperability with Java and Python Iceberg implementations at 
the conceptual level (same registry key semantics, same TCK coverage) while 
using Rust's trait system for zero-cost abstraction within a layer.
+
+## Non-Goals
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet-with-Arrow implementation. ORC, Avro, Vortex, and Lance are follow-up 
work.
+
+2. **Runtime library loading.** Rust has no stable ABI. No format under 
discussion requires this.
+
+3. **Puffin support.** Puffin files have a different lifecycle and are handled 
separately.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds beneath it, not a replacement.
+
+5. **Implement variant shredding.** The hooks are provided. Implementation 
depends on [#2188](https://github.com/apache/iceberg-rust/pull/2188).
+
+6. **Complete crate separation.** This RFC establishes trait boundaries. 
Extraction is follow-up work per 
[#1817](https://github.com/apache/iceberg-rust/issues/1817) and 
[#1819](https://github.com/apache/iceberg-rust/issues/1819).
+
+7. **Change the Iceberg table spec.** Rust-only API change.
+
+8. **Modify manifest paths.** Manifests remain in Avro via existing code.
+
+## Design
+
+### Architecture
+
+Three independent axes determine how data flows through the system:
+
+| Axis | Controlled by | Determined when | Can change mid-session? |
+|------|---------------|-----------------|-------------------------|
+| **In-memory representation** | The engine embedding Iceberg (DataFusion, 
Spark, Comet) or the direct library user | Session start | No |
+| **Processing operations** | The Iceberg kernel | N/A (always available) | 
N/A |
+| **File format** | The table creator, stored in table metadata | Table 
creation | No (one format per table) |
+
+An Iceberg session has one in-memory representation. A table has one data file 
format. A session may scan multiple tables with different formats, and a 
proposed multi-table transaction could span Parquet and ORC tables. In all 
cases, batches of data flow through three layers with no intermediate 
conversions:
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                       Engine Layer                                │
+│  The engine's native data representation.                        │
+│  Examples: Arrow RecordBatch, Vortex compressed arrays           │
+│  The engine provides a type that implements DataBatch.            │
+│  This choice is fixed for the session.                           │
+└─────────────────────────────────────┬───────────────────────────┘
+                                      │ (same concrete type throughout)
+┌─────────────────────────────────────┼───────────────────────────┐
+│                    Processing Layer                               │

Review Comment:
   Yes I can fix that in a follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

Reply via email to