Re: [DISCUSS] FileFormat API proposal

Péter Váry Wed, 07 May 2025 04:17:09 -0700

Hi everyone,
The proposed API part is reviewed and ready to go. See:
https://github.com/apache/iceberg/pull/12774
Thanks to everyone who reviewed it already!


Many of you wanted to review, but I know that the time constraints are
there for everyone. I still very much would like to hear your voices, so I
will not merge the PR this week. Please review it if you.

Thanks,
Peter

Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. ápr. 16.,
Sze, 7:02):

> Hi Renjie,
> The first one for the proposed new API is here:
> https://github.com/apache/iceberg/pull/12774
> Thanks, Peter
>
> On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com> wrote:
>
>> Hi, Peter:
>>
>> Thanks for the effort. I totally agree with splitting them into smaller
>> prs to move forward.
>>
>> I'm quite interested in this topic, and please ping me in those splitted
>> prs and I'll help to review.
>>
>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> Hi Peter
>>>
>>> Awesome ! Thank you so much !
>>> I will do a new pass.
>>>
>>> Regards
>>> JB
>>>
>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>> >
>>> > Hi JB,
>>> >
>>> > Separated out the proposed interfaces to a new PR:
>>> https://github.com/apache/iceberg/pull/12774.
>>> > Reviewers can check that out if they are only interested in how the
>>> new API would look like.
>>> >
>>> > Thanks,
>>> > Peter
>>> >
>>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr.
>>> 10., Cs, 18:25):
>>> >>
>>> >> Hi Peter
>>> >>
>>> >> Thanks for the ping about the PR.
>>> >>
>>> >> Maybe, to facilitate the review and move forward faster, we should
>>> >> split the PR in smaller PRs:
>>> >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>> >> - one for each file providers (Parquet, Avro, ORC)
>>> >>
>>> >> Thoughts ? I can help on the split if needed.
>>> >>
>>> >> Regards
>>> >> JB
>>> >>
>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >> >
>>> >> > Since the 1.9.0 release candidate has been created, I would like to
>>> resurrect this PR: https://github.com/apache/iceberg/pull/12298 to
>>> ensure that we have as long a testing period as possible for it.
>>> >> >
>>> >> > To recap, here is what the PR does after the review rounds:
>>> >> >
>>> >> > Created 3 interface classes which are implemented by the file
>>> formats:
>>> >> >
>>> >> > ReadBuilder - Builder for reading data from data files
>>> >> > AppenderBuilder - Builder for writing data to data files
>>> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the
>>> specific data file format and object model pair
>>> >> >
>>> >> > Updated the Parquet, Avro, ORC implementation for this interfaces,
>>> and deprecated the old reader/writer APIs
>>> >> > Created interface classes which will be used by the actual
>>> readers/writers of the data files:
>>> >> >
>>> >> > AppenderBuilder - Builder for writing a file
>>> >> > DataWriterBuilder - Builder for generating a data file
>>> >> > PositionDeleteWriterBuilder - Builder for generating a position
>>> delete file
>>> >> > EqualityDeleteWriterBuilder - Builder for generating an equality
>>> delete file
>>> >> > No ReadBuilder here - the file format reader builder is reused
>>> >> >
>>> >> > Created a WriterBuilder class which implements the interfaces above
>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>> based on a provided file format specific AppenderBuilder
>>> >> > Created an ObjectModelRegistry which stores the available
>>> ObjectModels, and engines and users could request the readers (ReadBuilder)
>>> and writers
>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>> from.
>>> >> > Created the appropriate ObjectModels:
>>> >> >
>>> >> > GenericObjectModels - for reading and writing Iceberg Records
>>> >> > SparkObjectModels - for reading (vectorized and non-vectorized) and
>>> writing Spark InternalRow/ColumnarBatch objects
>>> >> > FlinkObjectModels - for reading and writing Flink RowData objects
>>> >> > An arrow object model is also registered for vectorized reads of
>>> Parquet files into Arrow ColumnarBatch objects
>>> >> >
>>> >> > Updated the production code where the reading and writing happens
>>> to use the ObjectModelRegistry and the new reader/writer interfaces to
>>> access data files
>>> >> > Kept the testing code intact to ensure that the new API/code is not
>>> breaking anything
>>> >> >
>>> >> > The original change was not small, and grew substantially during
>>> the review rounds. So if you have questions, or I can do anything to make
>>> the review easier, don't hesitate to ask. I am happy to do anything to move
>>> this forward.
>>> >> >
>>> >> > Thanks,
>>> >> > Peter
>>> >> >
>>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>> márc. 26., Sze, 14:54):
>>> >> >>
>>> >> >> Hi everyone,
>>> >> >>
>>> >> >> I have updated the File Format API PR (
>>> https://github.com/apache/iceberg/pull/12298) based on the answers and
>>> review comments.
>>> >> >>
>>> >> >> I would like to merge this only after the 1.9.0 release so we have
>>> more time finding any issues and solving them before this goes to a release
>>> for the users.
>>> >> >>
>>> >> >> For this I have updated the deprecation comments accordingly.
>>> >> >> I would like to ask you to review the PR, so we iron out any
>>> possible requested changes and be ready for the merge as soon as possible
>>> after the 1.9.0 release.
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Peter
>>> >> >>
>>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>> márc. 21., P, 14:32):
>>> >> >>>
>>> >> >>> Hi Renije,
>>> >> >>>
>>> >> >>> > 1. File format filters
>>> >> >>> >
>>> >> >>> > Do the filters include both filter expressions from both user
>>> query and delete filter?
>>> >> >>>
>>> >> >>> The current discussion is about the filters from the user query.
>>> >> >>>
>>> >> >>> About the delete filter:
>>> >> >>> Based on the suggestions on the PR, I have moved the delete
>>> filter out from the main API. Created a `SupportsDeleteFilter` interface
>>> for it which would allow pushing down to the filter to Parquet vectorized
>>> readers in Spark, as this is the only place where we currently implemented
>>> this feature.
>>> >> >>>
>>> >> >>>
>>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025.
>>> márc. 21., P, 14:11):
>>> >> >>>>
>>> >> >>>> Hi, Peter:
>>> >> >>>>
>>> >> >>>> Thanks for the effort on this.
>>> >> >>>>
>>> >> >>>> 1. File format filters
>>> >> >>>>
>>> >> >>>> Do the filters include both filter expressions from both user
>>> query and delete filter?
>>> >> >>>>
>>> >> >>>> For filters from user query, I agree with you that we should
>>> keep the current behavior.
>>> >> >>>>
>>> >> >>>> For delete filters associated with data files, at first I
>>> thought file format readers should not care about this. But now I realized
>>> that maybe we need to also push it to file reader, this is useful when
>>> `IS_DELETED` metadata column is not necessary and we could use these
>>> filters (position deletes, etc) to further prune data.
>>> >> >>>>
>>> >> >>>> But anyway, I agree that we could postpone it in follow up pr.
>>> >> >>>>
>>> >> >>>> 2. Batch size configuration
>>> >> >>>>
>>> >> >>>> I'm leaning toward option 2.
>>> >> >>>>
>>> >> >>>> 3. Spark configuration
>>> >> >>>>
>>> >> >>>> I'm leaning towards using different configuration objects.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >> >>>>>
>>> >> >>>>> Hi Team,
>>> >> >>>>> Thanks everyone for the reviews on
>>> https://github.com/apache/iceberg/pull/12298!
>>> >> >>>>> I have addressed most of comments, but a few questions still
>>> remain which might merit a bit wider audience:
>>> >> >>>>>
>>> >> >>>>> We should decide on the expected filtering behavior when the
>>> filters are pushed down to the readers. Currently the filters are applied
>>> as best effort for the file format readers. Some readers (Avro) just skip
>>> them altogether. There was a suggestion on the PR that we might enforce
>>> more strict requirements and the readers either reject part of the filters,
>>> or they could apply them fully.
>>> >> >>>>> Batch sizes are currently parameters for the reader builders
>>> which could be set for non-vectorized readers too which could be confusing.
>>> >> >>>>> Currently the Spark batch reader uses different configuration
>>> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
>>> reviewers of the Comet PR. There was a suggestion on the current PR to use
>>> a common configuration instead.
>>> >> >>>>>
>>> >> >>>>> I would be interested in hearing your thoughts about these
>>> topics.
>>> >> >>>>>
>>> >> >>>>> My current take:
>>> >> >>>>>
>>> >> >>>>> File format filters: I am leaning towards keeping the current
>>> laninet behavior. Especially since Bloom filters are not able to do a full
>>> filtering, and are often used as a way to filter out unwanted records.
>>> Another option would be to implement a secondary filtering inside the file
>>> formats themselves which I think would cause extra complexity, and possible
>>> code duplication. Whatever the decision here, I would suggest moving this
>>> out to a next PR as the current changeset is big enough as it is.
>>> >> >>>>> Batch size configuration: Currently this is the only property
>>> which is different in the batch readers and the non-vectorized readers. I
>>> see 3 possible solutions:
>>> >> >>>>>
>>> >> >>>>> Create different builders for vectorized and non-vectorized
>>> reads - I don't think the current solution is confusing enough to worth the
>>> extra class
>>> >> >>>>> We could put this to the reader configuration property set -
>>> This could work, but "hide" the possible configuration mode which is valid
>>> for both Parquet and ORC readers
>>> >> >>>>> We could keep things as it is now - I would chose this one, but
>>> I don't have a strong opinion here
>>> >> >>>>>
>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to
>>> move to the direction the community decides on
>>> >> >>>>>
>>> >> >>>>> Thanks,
>>> >> >>>>> Peter
>>> >> >>>>>
>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont:
>>> 2025. márc. 14., P, 16:31):
>>> >> >>>>>>
>>> >> >>>>>> Hi Peter
>>> >> >>>>>>
>>> >> >>>>>> Thanks for the update. I will do a new pass on the PR.
>>> >> >>>>>>
>>> >> >>>>>> Regards
>>> >> >>>>>> JB
>>> >> >>>>>>
>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >> >>>>>> >
>>> >> >>>>>> > Hi Team,
>>> >> >>>>>> > I have rebased the File Format API proposal (
>>> https://github.com/apache/iceberg/pull/12298) to include the new
>>> changes needed for the Variant types. I would love to hear your feedback,
>>> especially Dan and Ryan, as you were the most active during our
>>> discussions. If I can help in any way to make the review easier, please let
>>> me know.
>>> >> >>>>>> > Thanks,
>>> >> >>>>>> > Peter
>>> >> >>>>>> >
>>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>> 2025. febr. 28., P, 17:50):
>>> >> >>>>>> >>
>>> >> >>>>>> >> Hi everyone,
>>> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the
>>> PR (https://github.com/apache/iceberg/pull/12298).
>>> >> >>>>>> >> Updated the code to address most of them. Please check if
>>> you agree with the general approach.
>>> >> >>>>>> >> If there is a consensus about the general approach, I
>>> could. separate out the PR to smaller pieces so we can have an easier time
>>> to review and merge those step-by-step.
>>> >> >>>>>> >> Thanks,
>>> >> >>>>>> >> Peter
>>> >> >>>>>> >>
>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont:
>>> 2025. febr. 20., Cs, 14:14):
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> Hi Peter
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> sorry for the late reply on this.
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and
>>> well written.
>>> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss
>>> all together.
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about
>>> DataFile API ?
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> Thoughts ?
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> Regards
>>> >> >>>>>> >>> JB
>>> >> >>>>>> >>>
>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >> >>>>>> >>> >
>>> >> >>>>>> >>> > Hi Team,
>>> >> >>>>>> >>> >
>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am
>>> exploring the possibility to define a FileFormat API for accessing
>>> different file formats. I have put together a proposal based on my findings.
>>> >> >>>>>> >>> >
>>> >> >>>>>> >>> > -------------------
>>> >> >>>>>> >>> > Iceberg currently supports 3 different file formats:
>>> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification many
>>> new features are added to Iceberg. Some of these features like new column
>>> types, default values require changes at the file format level. The changes
>>> are added by individual developers with different focus on the different
>>> file formats. As a result not all of the features are available for every
>>> supported file format.
>>> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or
>>> Lance [2] which either by specialization, or by applying newer research
>>> results could provide better alternatives for certain use-cases like random
>>> access for data, or storing ML models.
>>> >> >>>>>> >>> > -------------------
>>> >> >>>>>> >>> >
>>> >> >>>>>> >>> > Please check the detailed proposal [3] and the google
>>> document [4], and comment there or reply on the dev list if you have any
>>> suggestions.
>>> >> >>>>>> >>> >
>>> >> >>>>>> >>> > Thanks,
>>> >> >>>>>> >>> > Peter
>>> >> >>>>>> >>> >
>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>> >> >>>>>> >>> > [4] -
>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>> >> >>>>>> >>> >
>>>
>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to