[DISCUSS] FileFormat API proposal

Péter Váry Tue, 11 Feb 2025 08:47:02 -0800

Hi Team,

As mentioned earlier on our Community Sync I am exploring the
possibility to define a FileFormat API for accessing different file
formats. I have put together a proposal based on my findings.


-------------------
Iceberg currently supports 3 different file formats: Avro, Parquet, ORC.
With the introduction of Iceberg V3 specification many new features are
added to Iceberg. Some of these features like new column types, default
values require changes at the file format level. The changes are added by
individual developers with different focus on the different file formats.
As a result not all of the features are available for every supported file
format.
Also there are emerging file formats like Vortex [1] or Lance [2] which
either by specialization, or by applying newer research results could
provide better alternatives for certain use-cases like random access for
data, or storing ML models.
-------------------

Please check the detailed proposal [3] and the google document [4], and
comment there or reply on the dev list if you have any suggestions.

Thanks,
Peter

[1] - https://github.com/spiraldb/vortex
[2] - https://lancedb.github.io/lance/
[3] - https://github.com/apache/iceberg/issues/12225
[4] -
https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds

[DISCUSS] FileFormat API proposal

Reply via email to