Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

via GitHub Thu, 28 May 2026 14:48:29 -0700


Kurtiscwright commented on code in PR #2384:
URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3320861940



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,510 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+**Authors:** Kurtis C. Wright
+**Last updated:** 2026-05-20
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.1, Rust 1.92) is the core library of the 
Apache Iceberg Rust project. The crate depends directly on the `parquet` crate 
(with the `async` feature) and on the `arrow-*` crates. It has no feature flags 
today.
+
+For data file writing, the crate provides `FileWriter` / `FileWriterBuilder` 
traits that are format-agnostic at the type level, but `ParquetWriterBuilder` 
and `ParquetWriter` are the only implementation. Higher-level writers 
(`DataFileWriterBuilder`, `EqualityDeleteFileWriterBuilder`) are generic over 
any `FileWriterBuilder`, but every instantiation uses Parquet.
+
+For data file reading, `ArrowReaderBuilder` and `ArrowReader` are 
Parquet-specific despite the generic name. `TableScan::to_arrow` wires 
`ArrowReaderBuilder` as the only reader path. `FileScanTask` carries a 
`data_file_format` field, but the reader ignores it.
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read by the `iceberg` 
crate today, even though both are valid per the Iceberg spec.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` and threading format-specific logic through every layer.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options meaningless for other formats. The name 
conflates the in-memory representation with the on-disk format.
+
+3. **No format-agnostic statistics.** Statistics computation is tightly 
coupled to Parquet's `Statistics` type.
+
+4. **V3 types will need per-format serialization.** Variant uses shredding in 
Parquet, binary in ORC, unions in Avro. Without a format abstraction, each new 
type means new `match` arms everywhere.
+
+5. **Arrow version coupling.** The core crate depends on specific `arrow-*` 
versions. Upgrading Arrow in `datafusion` or other integrations forces lockstep 
upgrades across the dependency graph.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)). Java's design uses two 
generic parameters (data type `D` and engine schema `S`) with a registry keyed 
by `(FileFormat, Class<?>)`. PyIceberg has an open proposal 
([#3100](https://github.com/apache/iceberg-python/issues/3100)) that drops 
generics entirely, keying on file format alone.
+
+This RFC proposes a composable three-layer architecture that separates the 
in-memory processing representation from the file format layer, using Rust's 
trait system for static dispatch within a layer and dynamic dispatch at layer 
boundaries. It aligns with the kernel architecture proposed in 
[#1817](https://github.com/apache/iceberg-rust/issues/1817) and the 
modularization tracked in 
[#1819](https://github.com/apache/iceberg-rust/issues/1819).
+
+## Goals
+
+1. Define a composable three-layer architecture where the file format, 
in-memory processing representation, and engine are independent axes of 
variation. A `DataBatch` trait defines the processing contract. `FormatReader` 
and `FormatWriter` traits bridge file formats to batch types. No layer imposes 
a conversion on another.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration.
+
+3. Establish the crate architecture that allows the core `iceberg` kernel to 
be representation-agnostic, decoupling Arrow version pinning from the core 
library.
+
+4. Provide interoperability with Java and Python Iceberg implementations at 
the conceptual level (same registry key semantics, same TCK coverage) while 
using Rust's trait system for zero-cost abstraction within a layer.
+
+## Non-Goals
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet-with-Arrow implementation. ORC, Avro, Vortex, and Lance are follow-up 
work.
+
+2. **Runtime library loading.** Rust has no stable ABI. No format under 
discussion requires this.
+
+3. **Puffin support.** Puffin files have a different lifecycle and are handled 
separately.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds beneath it, not a replacement.
+
+5. **Implement variant shredding.** The hooks are provided. Implementation 
depends on [#2188](https://github.com/apache/iceberg-rust/pull/2188).
+
+6. **Complete crate separation.** This RFC establishes trait boundaries. 
Extraction is follow-up work per 
[#1817](https://github.com/apache/iceberg-rust/issues/1817) and 
[#1819](https://github.com/apache/iceberg-rust/issues/1819).
+
+7. **Change the Iceberg table spec.** Rust-only API change.
+
+8. **Modify manifest paths.** Manifests remain in Avro via existing code.
+
+## Design
+
+### Architecture
+
+Three independent axes determine how data flows through the system:
+
+| Axis | Controlled by | Determined when | Can change mid-session? |
+|------|---------------|-----------------|-------------------------|
+| **In-memory representation** | The engine embedding Iceberg (DataFusion, 
Spark, Comet) or the direct library user | Session start | No |
+| **Processing operations** | The Iceberg kernel | N/A (always available) | 
N/A |
+| **File format** | The table creator, stored in table metadata | Table 
creation | No (one format per table) |
+
+An Iceberg session has one in-memory representation. A table has one data file 
format. A session may scan multiple tables with different formats, and a 
proposed multi-table transaction could span Parquet and ORC tables. In all 
cases, batches of data flow through three layers with no intermediate 
conversions:
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                       Engine Layer                                │
+│  The engine's native data representation.                        │
+│  Examples: Arrow RecordBatch, Vortex compressed arrays           │
+│  The engine provides a type that implements DataBatch.            │
+│  This choice is fixed for the session.                           │
+└─────────────────────────────────────┬───────────────────────────┘
+                                      │ (same concrete type throughout)
+┌─────────────────────────────────────┼───────────────────────────┐
+│                    Processing Layer                               │
+│  Iceberg operations on in-memory data: expression evaluation,    │
+│  partition transforms, schema evolution, metrics collection,     │
+│  constant injection, delete application.                         │
+│  Works with the concrete batch type chosen by the engine.        │
+└─────────────────────────────────────┬───────────────────────────┘
+                                      │ (same concrete type throughout)
+┌─────────────────────────────────────┼───────────────────────────┐
+│                    File Format Layer                              │
+│  FormatReader/FormatWriter implementations read/write physical   │
+│  files, producing/consuming the engine's batch type directly.    │
+│  One implementation per (format, batch type) pair.               │
+└─────────────────────────────────────┬───────────────────────────┘
+                                      │
+                           Physical Files
+                      (Parquet, ORC, Avro, ...)
+```
+
+### Separation of responsibilities
+
+Each implementor has a clearly defined contract. The kernel provides 
processing operations that format and engine developers do not need to 
reimplement.
+
+**Format implementor** (for example, adding ORC):
+
+| MUST implement | Handled by the kernel (unless the format opts in) |
+|---|---|
+| Decode requested columns from disk (projection is an I/O decision) | 
Residual row-level filtering |
+| Encode batches to disk | Schema evolution (add columns, widen types, 
reorder) |
+| Collect file-level metrics during write | Constant injection (metadata 
columns like `_file`, `_pos`) |
+| Handle format-specific optimizations (variant shredding, statistics-based 
chunk skipping) | Delete application (merge-on-read) |
+| Report what operations the reader already handled | Partition routing, file 
rolling, location generation |
+
+A format reader MAY handle operations from the right column during I/O if it 
can do so more efficiently (for example, Parquet reading an `int32` column 
directly into `int64` for schema evolution). When it does, it reports what it 
handled via `ReadResult` so the kernel skips the redundant pass. The kernel 
handles anything the format does not.
+
+**DataBatch implementor** (for example, adding GPU-native buffers):
+
+| MUST implement | Gets for free |
+|---|---|
+| `filter` — evaluate a predicate against the data | All kernel orchestration |
+| `project` — subset columns by field ID | All format readers that produce the 
type |
+| `evolve_schema` — widen types, add columns with defaults | TCK to validate 
correctness |
+| `inject_constants` — add metadata column values | |
+| `column_metrics` — compute min/max/null/NaN stats | |
+
+**Engine implementor** (for example, adding Comet):
+
+| What you want | What you implement |
+|---|---|
+| Arrow works for your engine | Nothing. Use the shipped `RecordBatch` default 
and existing format readers. |
+| Custom in-memory type | `impl DataBatch for YourType`. Pass TCK Layer 1. 
Register format readers that produce your type. |
+| Full control over execution | Bypass the kernel's default scan loop. Use 
format readers as a stream source and drive your own processing. |
+
+### Core traits
+
+**`DataBatch`** defines the contract for an in-memory data representation that 
the Iceberg kernel can process. The engine chooses a concrete batch type at 
session start and that choice does not change. `DataBatch` methods (`filter`, 
`project`, `evolve_schema`, `inject_constants`) return `Self` — they operate on 
the concrete type. Implementations have full freedom to fuse operations 
internally for performance. The kernel does not decompose operations into steps 
or dictate intermediate materializations.
+
+The shipped default is `impl DataBatch for RecordBatch`. Supporting a new 
representation requires implementing `DataBatch` and passing the TCK Layer 1 
suite.
+
+**`FormatReader`** reads a file and produces a stream of batches. Its contract:
+
+- The reader MUST decode only the columns in `ReadOptions::schema` (projection 
is an I/O concern).
+- The reader MAY skip chunks that cannot match `ReadOptions::predicate` using 
format-level statistics.
+- The reader MAY support byte-range splits via `ReadOptions::split_start` / 
`split_length`.
+- The reader MAY handle schema evolution or other kernel operations during 
decode if the format can do so more efficiently (e.g., Parquet type promotion 
during decompression).
+- The reader MUST NOT handle partitioning, transaction management, or name 
mapping (field ID resolution for files not written by Iceberg). Those are 
always kernel responsibilities.
+
+The reader returns a `ReadResult` containing the batch stream and reporting 
what operations the reader handled. The kernel applies any remaining operations 
(residual filtering, schema evolution, constant injection, delete application) 
that the format did not.
+
+**`FormatWriter`** writes batches to a file. Its contract:
+
+- The writer MUST encode batches into the file format.
+- The writer MUST collect file-level metrics per 
`WriteOptions::metrics_config`.
+- The writer MUST handle format-specific encoding concerns (like variant 
shredding layout for Parquet).
+- The writer MUST NOT handle partitioning, sorting, or transaction management. 
The caller sends pre-partitioned, pre-sorted batches. The kernel handles commit.
+
+Both traits are dyn-compatible: no associated types, no generic parameters. 
The registry stores them as `Arc<dyn FormatReader>` and `Arc<dyn 
FormatWriter>`. This is the same pattern as `Storage` and `StorageFactory` in 
the IO layer.
+
+**`ReadOptions`** configures a read: Iceberg projection schema, filter 
predicate for pushdown, byte-range splits, case sensitivity, batch size, 
format-specific properties, and constant values for metadata columns.
+
+**`WriteOptions`** configures a write: Iceberg schema, format-specific 
properties, file-level metadata, content type, metrics collection 
configuration, overwrite flag, and encryption parameters.
+
+Both structs are `#[non_exhaustive]`. Fields are added without a breaking 
change.
+
+**`ReadResult`** contains the batch stream and communicates what the reader 
handled. Additional fields for reporting other handled operations (schema 
evolution, constant injection) will be added as formats opt in to handling 
them. The struct is `#[non_exhaustive]` so fields are added without a breaking 
change.
+
+```rust
+#[non_exhaustive]
+pub struct ReadResult {
+    /// The stream of batches with requested columns decoded.
+    pub stream: DataStream,
+    /// The portion of the predicate the reader could NOT evaluate.
+    /// None means the reader fully handled the predicate.
+    /// Some(predicate) means the kernel must apply residual filtering.
+    pub residual_predicate: Option<BoundPredicate>,
+    /// Whether the reader applied schema evolution during decode.
+    /// If true, the kernel skips the post-read schema evolution pass.
+    pub schema_evolved: bool,
+}
+```
+
+**`FormatFileWriter`** accepts batches via `write_batch(&dyn DataBatch)` and 
returns a `WriterResult` on `close()`. `WriterResult` contains 
`Vec<DataFileBuilder>` — builders rather than completed `DataFile` values 
because the format layer fills format-specific fields (file size, column sizes, 
metrics from its own metadata) while the kernel fills Iceberg-level fields 
(partition values, sequence number, snapshot ID).
+
+**`FormatRegistry`** maps `(DataFileFormat, TypeId)` pairs to reader and 
writer implementations. The first dimension is the file format (determined at 
runtime from table metadata). The second dimension is the batch type (fixed at 
session start by the engine). This matches Java's `FormatModelRegistry` key of 
`(FileFormat, Class<?>)`.
+
+The registry exposes `reader::<B>(format)` and `writer::<B>(format)`, where 
`B` is the batch type the engine chose. Errors are eager: if `(format, B)` is 
not registered, the call fails immediately at scan planning time rather than 
mid-stream during I/O. This is a runtime contract — the `TypeId` tells the 
registry what batch type the caller expects, but the type system does not 
enforce that the reader actually produces that type. A reader that returns the 
wrong type fails at the first downcast with a clear error. Java makes the same 
tradeoff with `Class<?>`.
+
+A process-wide default registry is available via `OnceLock`. Custom registries 
can be provided to `TableScanBuilder` for tests or restricted configurations.

Review Comment:
   Are you referring to dependency injection or do you mean raise the registry 
up to the Catalog for initialization?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

Reply via email to