Re: [DISCUSS] FileFormat API proposal

Péter Váry Tue, 18 Feb 2025 01:17:08 -0800

Accidentally force-pushed :(
The new links are here:

   -
   
https://github.com/apache/iceberg/pull/12298/commits/583cccb6e036323ee74a74bf3b06a40bf16f8982
   - The API Interface classes
   -
   
https://github.com/apache/iceberg/pull/12298/commits/217e68caa61667032da3d710401078bb50b0a99f
   - Moving the Parquet/Avro/ORC readers and writers to implement these
   interfaces
   -
   
https://github.com/apache/iceberg/pull/12298/commits/7989416718657871760ae010dcb46a92904c1768
   - Moving the implementation of the generic readers/writers with the new
   interfaces
   -
   
https://github.com/apache/iceberg/pull/12298/commits/6595ccc381d4931bcf04bbdb1db8982c3f450bb4
   - Arrow reader implementation with the new interfaces
   -
   
https://github.com/apache/iceberg/pull/12298/commits/ce9b82aa55bdfddbb4ba3b1230f9f10342adec6d
   - Spark reader/writer implementation with the new interfaces
   -
   
https://github.com/apache/iceberg/pull/12298/commits/313c2d59b04db390be09172356d3f5359e6f6d6e
   - Flink reader/writer implementation with the new interfaces



Péter Váry <[email protected]> ezt írta (időpont: 2025. febr.
18., K, 10:08):

> Hi Renjie,
>
> Based on your feedback, I have created a PR which separates out the
> different logical parts to different commits:
> https://github.com/apache/iceberg/pull/12298
> The following parts are separated:
>
>    -
>    
> https://github.com/apache/iceberg/pull/12298/commits/1ad230f67df014b424c3547603831f5e637b96d0
>    - The API Interface classes
>    -
>    
> https://github.com/apache/iceberg/pull/12298/commits/6fa135927676fd080d8322d7d09cf2b86f54de36
>    - Moving the Parquet/Avro/ORC readers and writers to implement these
>    interfaces
>    -
>    
> https://github.com/apache/iceberg/pull/12298/commits/b6ab3d059732b7c898dd2a385f0cfa8a7956e999
>    - Moving the implementation of the generic readers/writers with the new
>    interfaces
>    -
>    
> https://github.com/apache/iceberg/pull/12298/commits/aba830a86f535b2d1363b350d5f8b8622b608f1a
>    - Arrow reader implementation with the new interfaces
>    -
>    
> https://github.com/apache/iceberg/pull/12298/commits/21179b8d0f7d1f8db3d9ea532d8cc776533b3fdf
>    - Spark reader/writer implementation with the new interfaces
>    -
>    
> https://github.com/apache/iceberg/pull/12298/commits/907089c15fb497879ac879ff1d9227fc684d356d
>    - Flink reader/writer implementation with the new interfaces
>
> Thanks,
> Peter
>
>
>
> Péter Váry <[email protected]> ezt írta (időpont: 2025. febr.
> 14., P, 11:30):
>
>> Hi Renjie,
>> Here is the WIP PR for the readers:
>> https://github.com/apache/iceberg/pull/12069
>> Here is the WIP PR for the writers:
>> https://github.com/apache/iceberg/pull/12164
>>
>> If you want to concentrate on the proposed new API, maybe this is the
>> best place to start:
>> https://github.com/apache/iceberg/compare/main...pvary:iceberg:file_format_api_minimal_few_class
>> Thanks,
>> Peter
>>
>> Renjie Liu <[email protected]> ezt írta (időpont: 2025. febr. 14.,
>> P, 11:15):
>>
>>> Hi, Peter:
>>>
>>> Thanks for raising this, and this proposal sounds quite interesting to
>>> me.
>>>
>>> I've reviewed the doc but it still seems too abstract to understand, do
>>> you mind to submit a pr so that it would be more clear what's changed?
>>>
>>> On Wed, Feb 12, 2025 at 12:46 AM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> As mentioned earlier on our Community Sync I am exploring the
>>>> possibility to define a FileFormat API for accessing different file
>>>> formats. I have put together a proposal based on my findings.
>>>>
>>>> -------------------
>>>> Iceberg currently supports 3 different file formats: Avro, Parquet,
>>>> ORC. With the introduction of Iceberg V3 specification many new features
>>>> are added to Iceberg. Some of these features like new column types, default
>>>> values require changes at the file format level. The changes are added by
>>>> individual developers with different focus on the different file formats.
>>>> As a result not all of the features are available for every supported file
>>>> format.
>>>> Also there are emerging file formats like Vortex [1] or Lance [2] which
>>>> either by specialization, or by applying newer research results could
>>>> provide better alternatives for certain use-cases like random access for
>>>> data, or storing ML models.
>>>> -------------------
>>>>
>>>> Please check the detailed proposal [3] and the google document [4], and
>>>> comment there or reply on the dev list if you have any suggestions.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> [1] - https://github.com/spiraldb/vortex
>>>> [2] - https://lancedb.github.io/lance/
>>>> [3] - https://github.com/apache/iceberg/issues/12225
>>>> [4] -
>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>
>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to