Hi everyone, The proposed API part is reviewed and ready to go. See: https://github.com/apache/iceberg/pull/12774 Thanks to everyone who reviewed it already!
Many of you wanted to review, but I know that the time constraints are there for everyone. I still very much would like to hear your voices, so I will not merge the PR this week. Please review it if you. Thanks, Peter Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. ápr. 16., Sze, 7:02): > Hi Renjie, > The first one for the proposed new API is here: > https://github.com/apache/iceberg/pull/12774 > Thanks, Peter > > On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com> wrote: > >> Hi, Peter: >> >> Thanks for the effort. I totally agree with splitting them into smaller >> prs to move forward. >> >> I'm quite interested in this topic, and please ping me in those splitted >> prs and I'll help to review. >> >> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi Peter >>> >>> Awesome ! Thank you so much ! >>> I will do a new pass. >>> >>> Regards >>> JB >>> >>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> > >>> > Hi JB, >>> > >>> > Separated out the proposed interfaces to a new PR: >>> https://github.com/apache/iceberg/pull/12774. >>> > Reviewers can check that out if they are only interested in how the >>> new API would look like. >>> > >>> > Thanks, >>> > Peter >>> > >>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr. >>> 10., Cs, 18:25): >>> >> >>> >> Hi Peter >>> >> >>> >> Thanks for the ping about the PR. >>> >> >>> >> Maybe, to facilitate the review and move forward faster, we should >>> >> split the PR in smaller PRs: >>> >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel, >>> >> AppenderBuilder, DataWriterBuilder, ...) >>> >> - one for each file providers (Parquet, Avro, ORC) >>> >> >>> >> Thoughts ? I can help on the split if needed. >>> >> >>> >> Regards >>> >> JB >>> >> >>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry < >>> peter.vary.apa...@gmail.com> wrote: >>> >> > >>> >> > Since the 1.9.0 release candidate has been created, I would like to >>> resurrect this PR: https://github.com/apache/iceberg/pull/12298 to >>> ensure that we have as long a testing period as possible for it. >>> >> > >>> >> > To recap, here is what the PR does after the review rounds: >>> >> > >>> >> > Created 3 interface classes which are implemented by the file >>> formats: >>> >> > >>> >> > ReadBuilder - Builder for reading data from data files >>> >> > AppenderBuilder - Builder for writing data to data files >>> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the >>> specific data file format and object model pair >>> >> > >>> >> > Updated the Parquet, Avro, ORC implementation for this interfaces, >>> and deprecated the old reader/writer APIs >>> >> > Created interface classes which will be used by the actual >>> readers/writers of the data files: >>> >> > >>> >> > AppenderBuilder - Builder for writing a file >>> >> > DataWriterBuilder - Builder for generating a data file >>> >> > PositionDeleteWriterBuilder - Builder for generating a position >>> delete file >>> >> > EqualityDeleteWriterBuilder - Builder for generating an equality >>> delete file >>> >> > No ReadBuilder here - the file format reader builder is reused >>> >> > >>> >> > Created a WriterBuilder class which implements the interfaces above >>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >>> based on a provided file format specific AppenderBuilder >>> >> > Created an ObjectModelRegistry which stores the available >>> ObjectModels, and engines and users could request the readers (ReadBuilder) >>> and writers >>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >>> from. >>> >> > Created the appropriate ObjectModels: >>> >> > >>> >> > GenericObjectModels - for reading and writing Iceberg Records >>> >> > SparkObjectModels - for reading (vectorized and non-vectorized) and >>> writing Spark InternalRow/ColumnarBatch objects >>> >> > FlinkObjectModels - for reading and writing Flink RowData objects >>> >> > An arrow object model is also registered for vectorized reads of >>> Parquet files into Arrow ColumnarBatch objects >>> >> > >>> >> > Updated the production code where the reading and writing happens >>> to use the ObjectModelRegistry and the new reader/writer interfaces to >>> access data files >>> >> > Kept the testing code intact to ensure that the new API/code is not >>> breaking anything >>> >> > >>> >> > The original change was not small, and grew substantially during >>> the review rounds. So if you have questions, or I can do anything to make >>> the review easier, don't hesitate to ask. I am happy to do anything to move >>> this forward. >>> >> > >>> >> > Thanks, >>> >> > Peter >>> >> > >>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>> márc. 26., Sze, 14:54): >>> >> >> >>> >> >> Hi everyone, >>> >> >> >>> >> >> I have updated the File Format API PR ( >>> https://github.com/apache/iceberg/pull/12298) based on the answers and >>> review comments. >>> >> >> >>> >> >> I would like to merge this only after the 1.9.0 release so we have >>> more time finding any issues and solving them before this goes to a release >>> for the users. >>> >> >> >>> >> >> For this I have updated the deprecation comments accordingly. >>> >> >> I would like to ask you to review the PR, so we iron out any >>> possible requested changes and be ready for the merge as soon as possible >>> after the 1.9.0 release. >>> >> >> >>> >> >> Thanks, >>> >> >> Peter >>> >> >> >>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>> márc. 21., P, 14:32): >>> >> >>> >>> >> >>> Hi Renije, >>> >> >>> >>> >> >>> > 1. File format filters >>> >> >>> > >>> >> >>> > Do the filters include both filter expressions from both user >>> query and delete filter? >>> >> >>> >>> >> >>> The current discussion is about the filters from the user query. >>> >> >>> >>> >> >>> About the delete filter: >>> >> >>> Based on the suggestions on the PR, I have moved the delete >>> filter out from the main API. Created a `SupportsDeleteFilter` interface >>> for it which would allow pushing down to the filter to Parquet vectorized >>> readers in Spark, as this is the only place where we currently implemented >>> this feature. >>> >> >>> >>> >> >>> >>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. >>> márc. 21., P, 14:11): >>> >> >>>> >>> >> >>>> Hi, Peter: >>> >> >>>> >>> >> >>>> Thanks for the effort on this. >>> >> >>>> >>> >> >>>> 1. File format filters >>> >> >>>> >>> >> >>>> Do the filters include both filter expressions from both user >>> query and delete filter? >>> >> >>>> >>> >> >>>> For filters from user query, I agree with you that we should >>> keep the current behavior. >>> >> >>>> >>> >> >>>> For delete filters associated with data files, at first I >>> thought file format readers should not care about this. But now I realized >>> that maybe we need to also push it to file reader, this is useful when >>> `IS_DELETED` metadata column is not necessary and we could use these >>> filters (position deletes, etc) to further prune data. >>> >> >>>> >>> >> >>>> But anyway, I agree that we could postpone it in follow up pr. >>> >> >>>> >>> >> >>>> 2. Batch size configuration >>> >> >>>> >>> >> >>>> I'm leaning toward option 2. >>> >> >>>> >>> >> >>>> 3. Spark configuration >>> >> >>>> >>> >> >>>> I'm leaning towards using different configuration objects. >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry < >>> peter.vary.apa...@gmail.com> wrote: >>> >> >>>>> >>> >> >>>>> Hi Team, >>> >> >>>>> Thanks everyone for the reviews on >>> https://github.com/apache/iceberg/pull/12298! >>> >> >>>>> I have addressed most of comments, but a few questions still >>> remain which might merit a bit wider audience: >>> >> >>>>> >>> >> >>>>> We should decide on the expected filtering behavior when the >>> filters are pushed down to the readers. Currently the filters are applied >>> as best effort for the file format readers. Some readers (Avro) just skip >>> them altogether. There was a suggestion on the PR that we might enforce >>> more strict requirements and the readers either reject part of the filters, >>> or they could apply them fully. >>> >> >>>>> Batch sizes are currently parameters for the reader builders >>> which could be set for non-vectorized readers too which could be confusing. >>> >> >>>>> Currently the Spark batch reader uses different configuration >>> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the >>> reviewers of the Comet PR. There was a suggestion on the current PR to use >>> a common configuration instead. >>> >> >>>>> >>> >> >>>>> I would be interested in hearing your thoughts about these >>> topics. >>> >> >>>>> >>> >> >>>>> My current take: >>> >> >>>>> >>> >> >>>>> File format filters: I am leaning towards keeping the current >>> laninet behavior. Especially since Bloom filters are not able to do a full >>> filtering, and are often used as a way to filter out unwanted records. >>> Another option would be to implement a secondary filtering inside the file >>> formats themselves which I think would cause extra complexity, and possible >>> code duplication. Whatever the decision here, I would suggest moving this >>> out to a next PR as the current changeset is big enough as it is. >>> >> >>>>> Batch size configuration: Currently this is the only property >>> which is different in the batch readers and the non-vectorized readers. I >>> see 3 possible solutions: >>> >> >>>>> >>> >> >>>>> Create different builders for vectorized and non-vectorized >>> reads - I don't think the current solution is confusing enough to worth the >>> extra class >>> >> >>>>> We could put this to the reader configuration property set - >>> This could work, but "hide" the possible configuration mode which is valid >>> for both Parquet and ORC readers >>> >> >>>>> We could keep things as it is now - I would chose this one, but >>> I don't have a strong opinion here >>> >> >>>>> >>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy to >>> move to the direction the community decides on >>> >> >>>>> >>> >> >>>>> Thanks, >>> >> >>>>> Peter >>> >> >>>>> >>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: >>> 2025. márc. 14., P, 16:31): >>> >> >>>>>> >>> >> >>>>>> Hi Peter >>> >> >>>>>> >>> >> >>>>>> Thanks for the update. I will do a new pass on the PR. >>> >> >>>>>> >>> >> >>>>>> Regards >>> >> >>>>>> JB >>> >> >>>>>> >>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry < >>> peter.vary.apa...@gmail.com> wrote: >>> >> >>>>>> > >>> >> >>>>>> > Hi Team, >>> >> >>>>>> > I have rebased the File Format API proposal ( >>> https://github.com/apache/iceberg/pull/12298) to include the new >>> changes needed for the Variant types. I would love to hear your feedback, >>> especially Dan and Ryan, as you were the most active during our >>> discussions. If I can help in any way to make the review easier, please let >>> me know. >>> >> >>>>>> > Thanks, >>> >> >>>>>> > Peter >>> >> >>>>>> > >>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >>> 2025. febr. 28., P, 17:50): >>> >> >>>>>> >> >>> >> >>>>>> >> Hi everyone, >>> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the >>> PR (https://github.com/apache/iceberg/pull/12298). >>> >> >>>>>> >> Updated the code to address most of them. Please check if >>> you agree with the general approach. >>> >> >>>>>> >> If there is a consensus about the general approach, I >>> could. separate out the PR to smaller pieces so we can have an easier time >>> to review and merge those step-by-step. >>> >> >>>>>> >> Thanks, >>> >> >>>>>> >> Peter >>> >> >>>>>> >> >>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: >>> 2025. febr. 20., Cs, 14:14): >>> >> >>>>>> >>> >>> >> >>>>>> >>> Hi Peter >>> >> >>>>>> >>> >>> >> >>>>>> >>> sorry for the late reply on this. >>> >> >>>>>> >>> >>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and >>> well written. >>> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss >>> all together. >>> >> >>>>>> >>> >>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about >>> DataFile API ? >>> >> >>>>>> >>> >>> >> >>>>>> >>> Thoughts ? >>> >> >>>>>> >>> >>> >> >>>>>> >>> Regards >>> >> >>>>>> >>> JB >>> >> >>>>>> >>> >>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < >>> peter.vary.apa...@gmail.com> wrote: >>> >> >>>>>> >>> > >>> >> >>>>>> >>> > Hi Team, >>> >> >>>>>> >>> > >>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am >>> exploring the possibility to define a FileFormat API for accessing >>> different file formats. I have put together a proposal based on my findings. >>> >> >>>>>> >>> > >>> >> >>>>>> >>> > ------------------- >>> >> >>>>>> >>> > Iceberg currently supports 3 different file formats: >>> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification many >>> new features are added to Iceberg. Some of these features like new column >>> types, default values require changes at the file format level. The changes >>> are added by individual developers with different focus on the different >>> file formats. As a result not all of the features are available for every >>> supported file format. >>> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or >>> Lance [2] which either by specialization, or by applying newer research >>> results could provide better alternatives for certain use-cases like random >>> access for data, or storing ML models. >>> >> >>>>>> >>> > ------------------- >>> >> >>>>>> >>> > >>> >> >>>>>> >>> > Please check the detailed proposal [3] and the google >>> document [4], and comment there or reply on the dev list if you have any >>> suggestions. >>> >> >>>>>> >>> > >>> >> >>>>>> >>> > Thanks, >>> >> >>>>>> >>> > Peter >>> >> >>>>>> >>> > >>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex >>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/ >>> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225 >>> >> >>>>>> >>> > [4] - >>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >>> >> >>>>>> >>> > >>> >>