Hi everyone,

I’d like to open a discussion on a new proposal to better support
unstructured data in Iceberg.

As tables increasingly need to reference unstructured objects (images,
video, ML artifacts, PDFs) that are too large to embed, the current
fallback is to use bare string URI columns. This has a few structural
problems: it bypasses catalog governance (requiring engines to hold broad
bucket-level credentials), lacks cross-engine portability, and breaks read
determinism if the underlying object is overwritten.

To solve this,  There is already an active proposal in the Parquet
community to introduce a native File logical type for physical files.  I've
drafted a proposal for a FileRef type (struct<path, etag>) which is
designed to layer directly on top of that work. While Parquet defines the
physical columnar representation, Iceberg's FileRef handles the
table-format layer (governance, read determinism, snapshot isolation, and
access brokering). A physical File column in Parquet will map 1:1 to
Iceberg's logical FileRef, ensuring a unified standard from the storage
layer up to the catalog.

The core idea is to shift the responsibility of access control to the
Iceberg REST Catalog. Instead of granting compute engines direct bucket
access, the proposal introduces a new object-access endpoint. The catalog
brokers access by vending short-lived credentials or pre-signed URLs
strictly for the referenced objects (validated against a new
fileref.allowed-locations table property).

You can read the full proposal draft here:
https://s.apache.org/iceberg-fileref

I would love to get your feedback on this approach.

Parquet Proposal:
https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?tab=t.0#heading=h.k8qyue4jj4rn

Best,
Talat Uyarer

Reply via email to