Hi everyone, I’d like to open a discussion on a new proposal to better support unstructured data in Iceberg.
As tables increasingly need to reference unstructured objects (images, video, ML artifacts, PDFs) that are too large to embed, the current fallback is to use bare string URI columns. This has a few structural problems: it bypasses catalog governance (requiring engines to hold broad bucket-level credentials), lacks cross-engine portability, and breaks read determinism if the underlying object is overwritten. To solve this, There is already an active proposal in the Parquet community to introduce a native File logical type for physical files. I've drafted a proposal for a FileRef type (struct<path, etag>) which is designed to layer directly on top of that work. While Parquet defines the physical columnar representation, Iceberg's FileRef handles the table-format layer (governance, read determinism, snapshot isolation, and access brokering). A physical File column in Parquet will map 1:1 to Iceberg's logical FileRef, ensuring a unified standard from the storage layer up to the catalog. The core idea is to shift the responsibility of access control to the Iceberg REST Catalog. Instead of granting compute engines direct bucket access, the proposal introduces a new object-access endpoint. The catalog brokers access by vending short-lived credentials or pre-signed URLs strictly for the referenced objects (validated against a new fileref.allowed-locations table property). You can read the full proposal draft here: https://s.apache.org/iceberg-fileref I would love to get your feedback on this approach. Parquet Proposal: https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?tab=t.0#heading=h.k8qyue4jj4rn Best, Talat Uyarer
