pvary commented on code in PR #15380: URL: https://github.com/apache/iceberg/pull/15380#discussion_r2840630663
########## site/docs/blog/posts/2026-02-20-file-format-api.md: ########## @@ -0,0 +1,136 @@ +--- +date: 2026-02-20 +title: Finalizing the Apache Iceberg File Format API +slug: apache-iceberg-file-format-api-finalization +authors: + - iceberg-pmc +categories: + - announcement +--- + +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +The Apache Iceberg community is excited to announce the **finalization of the File Format API**, a major architectural milestone that makes file formats pluggable, consistent, and engine‑agnostic across the Iceberg Java codebase. + +<!-- more --> + +For years, Iceberg has delivered high‑quality support for **Parquet**, **Avro**, and **ORC**, but the data landscape has evolved dramatically. New formats now emphasize extremely fast random access, GPU‑native encodings, flexible file layouts, and built‑in indexing structures. To open up for the possibility to integrating such formats required a new foundation. + +The File Format API introduces a unified, extensible layer that engines can rely on when reading, writing Iceberg data files in any supported format. + +## Why a New File Format API Was Needed + +Iceberg’s original format‑handling code grew organically as support for Parquet, Avro, and ORC matured. Over time, this approach revealed several limitations. + +### Fragmented and duplicated logic +Each engine—Spark, Flink, and the generic Java implementation—maintained its own format‑specific readers, writers, and feature handling. Trying out a new format required deep modifications across multiple layers. + +### Large branching code paths +Support for multiple formats was implemented through large switch statements or branching logic, making it difficult to extend and easy to introduce inconsistencies. + +### Uneven feature support +Basic capabilities such as projection, filtering, and delete file handling needed custom work for each format/engine combination, slowing feature development, leaving features unavailable for some formats, and increasing maintenance cost. + +### Accelerating innovation in the ecosystem +New formats have emerged with capabilities such as: + +- Adaptive encodings for strings, numerics, or complex types +- Integrated indexes for fast point/range lookups +- CPU‑ and GPU‑optimized layouts +- File structures that do not match traditional row‑group‑based designs + +Enabling possible support for these formats cleanly required a more flexible architectural contract. + +## What the File Format API Provides + +The File Format API introduces a well‑defined, pluggable interface for integrating new formats into Iceberg. It allows engines to interact with formats through a standardized set of builders and metadata structures. Review Comment: Yeah. Kept as it is. It is working for delete files, but I don't want to advertise that, as we try to move everyone to use DVs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
