Kurtiscwright commented on code in PR #2384: URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3320861940
########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,510 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +**Authors:** Kurtis C. Wright +**Last updated:** 2026-05-20 + +## Background + +### Current state + +The `iceberg` crate (version 0.9.1, Rust 1.92) is the core library of the Apache Iceberg Rust project. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides `FileWriter` / `FileWriterBuilder` traits that are format-agnostic at the type level, but `ParquetWriterBuilder` and `ParquetWriter` are the only implementation. Higher-level writers (`DataFileWriterBuilder`, `EqualityDeleteFileWriterBuilder`) are generic over any `FileWriterBuilder`, but every instantiation uses Parquet. + +For data file reading, `ArrowReaderBuilder` and `ArrowReader` are Parquet-specific despite the generic name. `TableScan::to_arrow` wires `ArrowReaderBuilder` as the only reader path. `FileScanTask` carries a `data_file_format` field, but the reader ignores it. + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read by the `iceberg` crate today, even though both are valid per the Iceberg spec. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` and threading format-specific logic through every layer. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options meaningless for other formats. The name conflates the in-memory representation with the on-disk format. + +3. **No format-agnostic statistics.** Statistics computation is tightly coupled to Parquet's `Statistics` type. + +4. **V3 types will need per-format serialization.** Variant uses shredding in Parquet, binary in ORC, unions in Avro. Without a format abstraction, each new type means new `match` arms everywhere. + +5. **Arrow version coupling.** The core crate depends on specific `arrow-*` versions. Upgrading Arrow in `datafusion` or other integrations forces lockstep upgrades across the dependency graph. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)). Java's design uses two generic parameters (data type `D` and engine schema `S`) with a registry keyed by `(FileFormat, Class<?>)`. PyIceberg has an open proposal ([#3100](https://github.com/apache/iceberg-python/issues/3100)) that drops generics entirely, keying on file format alone. + +This RFC proposes a composable three-layer architecture that separates the in-memory processing representation from the file format layer, using Rust's trait system for static dispatch within a layer and dynamic dispatch at layer boundaries. It aligns with the kernel architecture proposed in [#1817](https://github.com/apache/iceberg-rust/issues/1817) and the modularization tracked in [#1819](https://github.com/apache/iceberg-rust/issues/1819). + +## Goals + +1. Define a composable three-layer architecture where the file format, in-memory processing representation, and engine are independent axes of variation. A `DataBatch` trait defines the processing contract. `FormatReader` and `FormatWriter` traits bridge file formats to batch types. No layer imposes a conversion on another. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. + +3. Establish the crate architecture that allows the core `iceberg` kernel to be representation-agnostic, decoupling Arrow version pinning from the core library. + +4. Provide interoperability with Java and Python Iceberg implementations at the conceptual level (same registry key semantics, same TCK coverage) while using Rust's trait system for zero-cost abstraction within a layer. + +## Non-Goals + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet-with-Arrow implementation. ORC, Avro, Vortex, and Lance are follow-up work. + +2. **Runtime library loading.** Rust has no stable ABI. No format under discussion requires this. + +3. **Puffin support.** Puffin files have a different lifecycle and are handled separately. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds beneath it, not a replacement. + +5. **Implement variant shredding.** The hooks are provided. Implementation depends on [#2188](https://github.com/apache/iceberg-rust/pull/2188). + +6. **Complete crate separation.** This RFC establishes trait boundaries. Extraction is follow-up work per [#1817](https://github.com/apache/iceberg-rust/issues/1817) and [#1819](https://github.com/apache/iceberg-rust/issues/1819). + +7. **Change the Iceberg table spec.** Rust-only API change. + +8. **Modify manifest paths.** Manifests remain in Avro via existing code. + +## Design + +### Architecture + +Three independent axes determine how data flows through the system: + +| Axis | Controlled by | Determined when | Can change mid-session? | +|------|---------------|-----------------|-------------------------| +| **In-memory representation** | The engine embedding Iceberg (DataFusion, Spark, Comet) or the direct library user | Session start | No | +| **Processing operations** | The Iceberg kernel | N/A (always available) | N/A | +| **File format** | The table creator, stored in table metadata | Table creation | No (one format per table) | + +An Iceberg session has one in-memory representation. A table has one data file format. A session may scan multiple tables with different formats, and a proposed multi-table transaction could span Parquet and ORC tables. In all cases, batches of data flow through three layers with no intermediate conversions: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Engine Layer │ +│ The engine's native data representation. │ +│ Examples: Arrow RecordBatch, Vortex compressed arrays │ +│ The engine provides a type that implements DataBatch. │ +│ This choice is fixed for the session. │ +└─────────────────────────────────────┬───────────────────────────┘ + │ (same concrete type throughout) +┌─────────────────────────────────────┼───────────────────────────┐ +│ Processing Layer │ +│ Iceberg operations on in-memory data: expression evaluation, │ +│ partition transforms, schema evolution, metrics collection, │ +│ constant injection, delete application. │ +│ Works with the concrete batch type chosen by the engine. │ +└─────────────────────────────────────┬───────────────────────────┘ + │ (same concrete type throughout) +┌─────────────────────────────────────┼───────────────────────────┐ +│ File Format Layer │ +│ FormatReader/FormatWriter implementations read/write physical │ +│ files, producing/consuming the engine's batch type directly. │ +│ One implementation per (format, batch type) pair. │ +└─────────────────────────────────────┬───────────────────────────┘ + │ + Physical Files + (Parquet, ORC, Avro, ...) +``` + +### Separation of responsibilities + +Each implementor has a clearly defined contract. The kernel provides processing operations that format and engine developers do not need to reimplement. + +**Format implementor** (for example, adding ORC): + +| MUST implement | Handled by the kernel (unless the format opts in) | +|---|---| +| Decode requested columns from disk (projection is an I/O decision) | Residual row-level filtering | +| Encode batches to disk | Schema evolution (add columns, widen types, reorder) | +| Collect file-level metrics during write | Constant injection (metadata columns like `_file`, `_pos`) | +| Handle format-specific optimizations (variant shredding, statistics-based chunk skipping) | Delete application (merge-on-read) | +| Report what operations the reader already handled | Partition routing, file rolling, location generation | + +A format reader MAY handle operations from the right column during I/O if it can do so more efficiently (for example, Parquet reading an `int32` column directly into `int64` for schema evolution). When it does, it reports what it handled via `ReadResult` so the kernel skips the redundant pass. The kernel handles anything the format does not. + +**DataBatch implementor** (for example, adding GPU-native buffers): + +| MUST implement | Gets for free | +|---|---| +| `filter` — evaluate a predicate against the data | All kernel orchestration | +| `project` — subset columns by field ID | All format readers that produce the type | +| `evolve_schema` — widen types, add columns with defaults | TCK to validate correctness | +| `inject_constants` — add metadata column values | | +| `column_metrics` — compute min/max/null/NaN stats | | + +**Engine implementor** (for example, adding Comet): + +| What you want | What you implement | +|---|---| +| Arrow works for your engine | Nothing. Use the shipped `RecordBatch` default and existing format readers. | +| Custom in-memory type | `impl DataBatch for YourType`. Pass TCK Layer 1. Register format readers that produce your type. | +| Full control over execution | Bypass the kernel's default scan loop. Use format readers as a stream source and drive your own processing. | + +### Core traits + +**`DataBatch`** defines the contract for an in-memory data representation that the Iceberg kernel can process. The engine chooses a concrete batch type at session start and that choice does not change. `DataBatch` methods (`filter`, `project`, `evolve_schema`, `inject_constants`) return `Self` — they operate on the concrete type. Implementations have full freedom to fuse operations internally for performance. The kernel does not decompose operations into steps or dictate intermediate materializations. + +The shipped default is `impl DataBatch for RecordBatch`. Supporting a new representation requires implementing `DataBatch` and passing the TCK Layer 1 suite. + +**`FormatReader`** reads a file and produces a stream of batches. Its contract: + +- The reader MUST decode only the columns in `ReadOptions::schema` (projection is an I/O concern). +- The reader MAY skip chunks that cannot match `ReadOptions::predicate` using format-level statistics. +- The reader MAY support byte-range splits via `ReadOptions::split_start` / `split_length`. +- The reader MAY handle schema evolution or other kernel operations during decode if the format can do so more efficiently (e.g., Parquet type promotion during decompression). +- The reader MUST NOT handle partitioning, transaction management, or name mapping (field ID resolution for files not written by Iceberg). Those are always kernel responsibilities. + +The reader returns a `ReadResult` containing the batch stream and reporting what operations the reader handled. The kernel applies any remaining operations (residual filtering, schema evolution, constant injection, delete application) that the format did not. + +**`FormatWriter`** writes batches to a file. Its contract: + +- The writer MUST encode batches into the file format. +- The writer MUST collect file-level metrics per `WriteOptions::metrics_config`. +- The writer MUST handle format-specific encoding concerns (like variant shredding layout for Parquet). +- The writer MUST NOT handle partitioning, sorting, or transaction management. The caller sends pre-partitioned, pre-sorted batches. The kernel handles commit. + +Both traits are dyn-compatible: no associated types, no generic parameters. The registry stores them as `Arc<dyn FormatReader>` and `Arc<dyn FormatWriter>`. This is the same pattern as `Storage` and `StorageFactory` in the IO layer. + +**`ReadOptions`** configures a read: Iceberg projection schema, filter predicate for pushdown, byte-range splits, case sensitivity, batch size, format-specific properties, and constant values for metadata columns. + +**`WriteOptions`** configures a write: Iceberg schema, format-specific properties, file-level metadata, content type, metrics collection configuration, overwrite flag, and encryption parameters. + +Both structs are `#[non_exhaustive]`. Fields are added without a breaking change. + +**`ReadResult`** contains the batch stream and communicates what the reader handled. Additional fields for reporting other handled operations (schema evolution, constant injection) will be added as formats opt in to handling them. The struct is `#[non_exhaustive]` so fields are added without a breaking change. + +```rust +#[non_exhaustive] +pub struct ReadResult { + /// The stream of batches with requested columns decoded. + pub stream: DataStream, + /// The portion of the predicate the reader could NOT evaluate. + /// None means the reader fully handled the predicate. + /// Some(predicate) means the kernel must apply residual filtering. + pub residual_predicate: Option<BoundPredicate>, + /// Whether the reader applied schema evolution during decode. + /// If true, the kernel skips the post-read schema evolution pass. + pub schema_evolved: bool, +} +``` + +**`FormatFileWriter`** accepts batches via `write_batch(&dyn DataBatch)` and returns a `WriterResult` on `close()`. `WriterResult` contains `Vec<DataFileBuilder>` — builders rather than completed `DataFile` values because the format layer fills format-specific fields (file size, column sizes, metrics from its own metadata) while the kernel fills Iceberg-level fields (partition values, sequence number, snapshot ID). + +**`FormatRegistry`** maps `(DataFileFormat, TypeId)` pairs to reader and writer implementations. The first dimension is the file format (determined at runtime from table metadata). The second dimension is the batch type (fixed at session start by the engine). This matches Java's `FormatModelRegistry` key of `(FileFormat, Class<?>)`. + +The registry exposes `reader::<B>(format)` and `writer::<B>(format)`, where `B` is the batch type the engine chose. Errors are eager: if `(format, B)` is not registered, the call fails immediately at scan planning time rather than mid-stream during I/O. This is a runtime contract — the `TypeId` tells the registry what batch type the caller expects, but the type system does not enforce that the reader actually produces that type. A reader that returns the wrong type fails at the first downcast with a clear error. Java makes the same tradeoff with `Class<?>`. + +A process-wide default registry is available via `OnceLock`. Custom registries can be provided to `TableScanBuilder` for tests or restricted configurations. Review Comment: Are you referring to dependency injection or do you mean raise the registry up to the Catalog for initialization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
