Peter, thanks for summarizing the 4 options. Both 0 and 1 seem good to me, as they are explicit and easier to deprecate and remove the position deletes in the future. Maybe option 0 is a tiny bit better as it is similar to the existing FileWriterFactory API.
I will leave PR related comments in the PR directly. On Mon, Sep 15, 2025 at 8:38 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Thanks for the feedback @Russell and @Renjie! > > Updated the PR accordingly. > Also removed the possibility to set the row schema for the position delete > writer. We will not need that after the PDWR deprecation. > > You can see one possible implementation in > https://github.com/apache/iceberg/pull/12298 - we can discuss that > separately. I just made sure that the new API is able to serve all of the > current needs. > > @Ryan: What are your thoughts? > > Are we in a stage when we can vote on the current API? > > Thanks, > Peter > > > Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. szept. 15., > H, 12:08): > >> I would also vote for option 0. This api has clean separation and makes >> refactoring easier, e.g. when we completely deprecate v2 table, we could >> mark the *positionDeleteWriteBuilder *method as deprecated, and it would >> be easier to remove its usage. >> >> On Fri, Sep 12, 2025 at 11:24 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> Now that I fully understand the situation I think option 0 as you've >>> written is probably the best thing to do as long as PositionDelete is a >>> class. I >>> think with hindsight it probably shouldn't have been a class and always >>> been an interface so that our internal code could produce rows which >>> implement PositionDelete rather than PositionDeletes that wrap rows. >>> >>> On Fri, Sep 12, 2025 at 8:02 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Let me summarize the state a bit: >>>> >>>> The FileFormat interface needs to expose two distinct methods: >>>> >>>> - WriteBuilder<InternalRow> >>>> - WriteBuilder<PositionDelete<InternalRow>> >>>> - After the PDWR deprecation this will be >>>> WriteBuilder<PositionDelete> >>>> - After V2 deprecation this will be not needed anymore >>>> >>>> Based on the file format methods, the Registry must support four >>>> builder types: >>>> >>>> - WriteBuilder<InternalRow> >>>> - DataWriteBuilder<InternalRow> >>>> - EqualityDeleteWriteBuilder<InternalRow> >>>> - PositionDeleteWriteBuilder<InternalRow> >>>> >>>> >>>> *API Design Considerations* >>>> There is an argument that the two WriteBuilder methods provided by >>>> FileFormat are essentially the same, differing only in the writerFunction. >>>> While this is technically correct for current implementations, I believe >>>> the API should clearly distinguish between the two writer types to >>>> highlight the differences. >>>> >>>> *Discussed Approaches* >>>> >>>> *0. Two Explicit Methods on FormatModel* (removed based on previous >>>> comments, but I personally still prefer this) >>>> >>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);* >>>> *WriteBuilder<PositionDelete<InternalRow>> >>>> positionDeleteWriteBuilder(OutputFile outputFile); * >>>> >>>> >>>> Pros: Clear separation of responsibilities >>>> >>>> *1. One Builder + One Converter* >>>> >>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);* >>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema schema);* >>>> >>>> >>>> Pros: Keeps the interface compact >>>> Cons: Requires additional documentation and understanding why the >>>> conversion logic is needed >>>> >>>> *2. Single Method with Javadoc Clarification* (most similar to the >>>> current approach) >>>> >>>> *WriteBuilder writeBuilder(OutputFile outputFile); * >>>> >>>> >>>> Pros: Minimalistic >>>> Cons: Least explicit; relies entirely on documentation >>>> >>>> *2/b. Single Builder with Type Parameter *(based on Russell's >>>> suggestion) >>>> >>>> *WriteBuilder writeBuilder(OutputFile outputFile);* >>>> *// Usage: builder.build(Class<D> inputType)* >>>> >>>> >>>> Pros: Flexible >>>> Cons: Relies on documentation to clarify the available input types >>>> >>>> *Bonus* >>>> Options 0 and 1 make it easier to phase out PositionDelete filtering >>>> once V2 tables are deprecated. >>>> >>>> Thanks, >>>> Peter >>>> >>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>> szept. 11., Cs, 18:36): >>>> >>>>> > Wouldn't PositionDelete<InternalRow> also be an InternalRow in this >>>>> example? I think that's what i'm confused about. >>>>> >>>>> With the *second approach*, the WriteBuilder doesn’t need to handle >>>>> PositionDelete objects directly. The conversion layer takes care of >>>>> that, so the WriteBuilder only needs to work with InternalRow. >>>>> >>>>> With the *first approach*, we shift that responsibility to the >>>>> WriteBuilder, which then has to support both InternalRow and >>>>> PositionDelete<InternalRow>. >>>>> >>>>> In both cases, the FormatModelRegistry API will still expose the more >>>>> concrete types (PositionDelete / InternalRow). However, under the *first >>>>> approach*, the lower-level API only needs to handle InternalRow, >>>>> simplifying its interface. >>>>> Thanks, >>>>> Peter >>>>> >>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. >>>>> szept. 11., Cs, 17:12): >>>>> >>>>>> Wouldn't PositionDelete<InternalRow> also be an InternalRow in this >>>>>> example? I think that's what i'm confused about. >>>>>> >>>>>> On Thu, Sep 11, 2025 at 5:35 AM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> Thanks, Russell, for taking a look at this! >>>>>>> >>>>>>> We need to expose four methods on the user-facing API ( >>>>>>> FormatModelRegistry): >>>>>>> >>>>>>> 1. *writeBuilder* – for writing arbitrary files without Iceberg >>>>>>> metadata. In the Iceberg codebase, this is exposed via >>>>>>> FlinkAppenderFactory and >>>>>>> the GenericAppenderFactory for creating FileAppender<RowData> >>>>>>> and FileAppender<Record> only. >>>>>>> 2. *dataWriteBuilder* – for creating and collecting metadata for >>>>>>> Iceberg DataFiles. >>>>>>> 3. *equalityDeleteWriteBuilder* – for creating and collecting >>>>>>> metadata for Iceberg EqualityDeleteFiles. >>>>>>> 4. *positionDeleteWriteBuilder* – for creating and collecting >>>>>>> metadata for Iceberg PositionDeleteFiles. >>>>>>> >>>>>>> We’d like to implement all four using a single WriteBuilder created >>>>>>> by the FormatModels. >>>>>>> >>>>>>> Your suggestion is a good one—it helps formalize the requirements >>>>>>> for the build method and also surfaces an important design question: >>>>>>> >>>>>>> *Who should be responsible for handling the differences between >>>>>>> normal rows (InternalRow) and position deletes >>>>>>> (PositionDelete<InternalRow>)*? >>>>>>> >>>>>>> - Should we have a more complex WriteBuilder class that can >>>>>>> create both DataFileAppender and PositionDeleteAppender? >>>>>>> - Or should we push this responsibility to the engine-specific >>>>>>> code, where we already have some logic (e.g., pathTransformFunc) >>>>>>> needed by each engine to create the PositionDeleteAppender? >>>>>>> >>>>>>> Thanks, >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: >>>>>>> 2025. szept. 11., Cs, 0:11): >>>>>>> >>>>>>>> I'm a little confused here, I think Ryan mentioned this in the >>>>>>>> comment here >>>>>>>> https://github.com/apache/iceberg/pull/12774/files#r2254967177 >>>>>>>> >>>>>>>> From my understanding there are two options? >>>>>>>> >>>>>>>> 1) We either are producing FormatModels that take a generic row >>>>>>>> type D and produce writers that all take D and write files. >>>>>>>> >>>>>>>> 2) we are creating IcebergModel specific writers that take >>>>>>>> DataFile, PositionDeleteFile, EqualityDeleteFile etc ... and write >>>>>>>> files >>>>>>>> >>>>>>>> The PositionDelete Converter issue seems to stem from attempting to >>>>>>>> do both model 1 (being very generic) and 2, wanting special code to >>>>>>>> deal >>>>>>>> with PositionDeleteFile<R> objects. >>>>>>>> >>>>>>>> It looks like the code in #12774 is mostly doing model 1, but we >>>>>>>> are trying to add in a specific converter for 2? >>>>>>>> >>>>>>>> Maybe i'm totally lost here but I was assuming we would do >>>>>>>> something a little scala-y like >>>>>>>> >>>>>>>> public <T> FileAppender<T> build(Class<T> type) { >>>>>>>> if (type == DataFile.class) return (FileAppender<T>) new >>>>>>>> DataFileAppender(); >>>>>>>> if (type == DeleteFile.class) return (FileAppender<T>) new >>>>>>>> DeleteFileAppender(); >>>>>>>> // ... >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> So that we only register a single signature and if writer specific >>>>>>>> implementation needs to do something special it can? I'm trying to >>>>>>>> catch >>>>>>>> back up to speed on this PR so it may help to do a quick summary of the >>>>>>>> current state and intent. (At least for me) >>>>>>>> >>>>>>>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Renjie, >>>>>>>>> Thanks for taking a look! >>>>>>>>> >>>>>>>>> Let me clarify a few points: >>>>>>>>> - The converter API is only required for writing position delete >>>>>>>>> files for V2 tables >>>>>>>>> - Currently, there are no plans to support vectorized writing via >>>>>>>>> the java API >>>>>>>>> - Even if we decide to support vectorized writes, I don't think we >>>>>>>>> would like to implement it for Positional Deletes, which are >>>>>>>>> deprecated in >>>>>>>>> the new spec. >>>>>>>>> - Also, once the positional deletes - which contain the deleted >>>>>>>>> rows - are deprecated (as planned), the conversion of the Position >>>>>>>>> Deletes >>>>>>>>> with only file name and position would be trivial, even for the >>>>>>>>> vectorized >>>>>>>>> writes. >>>>>>>>> >>>>>>>>> So from my perspective, the converter method exists purely for >>>>>>>>> backward compatibility, and we intend to remove it as soon as >>>>>>>>> possible. >>>>>>>>> Sacrificing good practices for the sake of a deprecated feature >>>>>>>>> doesn’t >>>>>>>>> seem worthwhile to me. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. >>>>>>>>> szept. 8., H, 12:34): >>>>>>>>> >>>>>>>>>> Hi, Peter: >>>>>>>>>> >>>>>>>>>> I would vote for the first approach. In spite of the compromises >>>>>>>>>> described, the api is still cleaner. Also I think there are some >>>>>>>>>> problems >>>>>>>>>> with the converter api. For example, for vectorized implementations >>>>>>>>>> such as >>>>>>>>>> comet which accepts columnar batch rather than rows, the converter >>>>>>>>>> method >>>>>>>>>> would make things more complicated. >>>>>>>>>> >>>>>>>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry < >>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I’ve initiated a discussion thread regarding the deprecation of >>>>>>>>>>> Position Deletes containing row data. You can follow it here: >>>>>>>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s >>>>>>>>>>> >>>>>>>>>>> We can proceed with the discussion about the native >>>>>>>>>>> reader/writer deprecation when we decided on the final API, as the >>>>>>>>>>> chosen >>>>>>>>>>> design may influence our approach. >>>>>>>>>>> >>>>>>>>>>> Since then, one more question has come up - hopefully the last: >>>>>>>>>>> *How should we handle Position Delete Writers?* >>>>>>>>>>> The File Format API should return builders for either rows or >>>>>>>>>>> PositionDelete objects. Currently the method >>>>>>>>>>> `WriteBuilder.createWriterFunc(Function<MessageType, >>>>>>>>>>> ParquetValueWriter<?>>)` defines the accepted input parameters for >>>>>>>>>>> the >>>>>>>>>>> writer. Users are responsible for ensuring that the writer function >>>>>>>>>>> and the >>>>>>>>>>> return type of the `WriteBuilder.build()` are compatible. In the >>>>>>>>>>> new API, >>>>>>>>>>> we no longer expose writer functions. We still expose FileContent, >>>>>>>>>>> since >>>>>>>>>>> writer configurations vary by content type, but we don’t expose the >>>>>>>>>>> types. >>>>>>>>>>> >>>>>>>>>>> There are two proposals for handling types for the WriteBuilders: >>>>>>>>>>> >>>>>>>>>>> 1. *Implicit Type Definition via FileContent* - the builder >>>>>>>>>>> parameter for FileContent would implicitly define the input type >>>>>>>>>>> for the >>>>>>>>>>> writer returned by build(), or >>>>>>>>>>> 2. *Engine level conversion* - Engines would convert >>>>>>>>>>> PositionDelete objects to their native types. >>>>>>>>>>> >>>>>>>>>>> In code: >>>>>>>>>>> >>>>>>>>>>> - In the 1st proposal, the >>>>>>>>>>> FormatModel.writeBuilder(OutputFile outputFile) can return >>>>>>>>>>> anything: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * WriteBuilder builder = >>>>>>>>>>> FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, >>>>>>>>>>> outputFile); >>>>>>>>>>> FileAppender<InternalRow> appender = >>>>>>>>>>> .schema(table.schema()) >>>>>>>>>>> .content(FileContent.DATA) .... .build(); * >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * // Exposed, but >>>>>>>>>>> FormatModelRegistry.positionDeleteWriteBuilder should be used >>>>>>>>>>> instead >>>>>>>>>>> WriteBuilder builder = FormatModelRegistry.writeBuilder(PARQUET, >>>>>>>>>>> InternalRow.class, outputFile); >>>>>>>>>>> FileAppender<PositionDelete<InternalRow>> appender = >>>>>>>>>>> .schema(table.schema()) >>>>>>>>>>> .content(FileContent.POSITION_DELETES) >>>>>>>>>>> .... .build();* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - In the 2nd proposal, the FormatModel needs another method: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema >>>>>>>>>>> schema); *example implementation: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * return delete -> { deleteRecord.update(0, >>>>>>>>>>> UTF8String.fromString(delete.path().toString())); >>>>>>>>>>> deleteRecord.update(1, delete.pos()); deleteRecord.update(2, >>>>>>>>>>> delete.row()); return deleteRecord; };* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * // Content is only used for writer property configuration >>>>>>>>>>> WriteBuilder<InternalRow> builder = >>>>>>>>>>> sparkFormatModel.writeBuilder(outputFile); >>>>>>>>>>> FileAppender<InternalRow> appender = >>>>>>>>>>> .schema(table.schema()) >>>>>>>>>>> .content(FileContent.DATA) .... .build();* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Drawbacks >>>>>>>>>>> >>>>>>>>>>> - Proposal 1: >>>>>>>>>>> - Type checking for the FileAppenders occurs only at >>>>>>>>>>> runtime, so user errors surface late. >>>>>>>>>>> - File Format specification must clearly specify which >>>>>>>>>>> builder type corresponds to which file content >>>>>>>>>>> parameter—generics would >>>>>>>>>>> offer better clarity. >>>>>>>>>>> - Inconsistent patterns between WriteBuilder and >>>>>>>>>>> ReadBuilder, as the latter can define output types via >>>>>>>>>>> generics. >>>>>>>>>>> - Proposal 2: >>>>>>>>>>> - Requires FormatModels to implement a converter method >>>>>>>>>>> to transform PositionDelete<InternalRow> into InternalRow. >>>>>>>>>>> >>>>>>>>>>> Since we deprecated writing position delete files in the V3 >>>>>>>>>>> spec, this extra method in the 2nd proposal will be deprecated too. >>>>>>>>>>> As a >>>>>>>>>>> result, in the long run, we will have a nice, clean API. >>>>>>>>>>> OTOH, if we accept the compromise described in the 1st proposal, >>>>>>>>>>> the results of our decision will remain, even when the functions are >>>>>>>>>>> removed. >>>>>>>>>>> >>>>>>>>>>> Looking forward to your thoughts. >>>>>>>>>>> Thanks, Peter >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry < >>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Team, >>>>>>>>>>>> >>>>>>>>>>>> During yesterday’s community sync, we discussed the current >>>>>>>>>>>> state of the File Format API proposal and identified two key >>>>>>>>>>>> questions that >>>>>>>>>>>> require input from the broader community: >>>>>>>>>>>> >>>>>>>>>>>> *1. Dropping support for Position Delete files with Row Data* >>>>>>>>>>>> >>>>>>>>>>>> The current Iceberg V2 spec [1] defines two types of position >>>>>>>>>>>> delete files: >>>>>>>>>>>> >>>>>>>>>>>> - Files that store only the file name and row position. >>>>>>>>>>>> - Files that also store the deleted row data. >>>>>>>>>>>> >>>>>>>>>>>> Although this feature is defined in the spec and some tests >>>>>>>>>>>> exist in the Iceberg codebase, we’re not aware of any actual >>>>>>>>>>>> implementation >>>>>>>>>>>> using the second type (with row data). Supporting V2 table writing >>>>>>>>>>>> via the >>>>>>>>>>>> new File Format API would be simpler if we dropped support for >>>>>>>>>>>> this feature. >>>>>>>>>>>> If you know of any use case or reason to retain support for >>>>>>>>>>>> position deletes with row data, please let us know. >>>>>>>>>>>> >>>>>>>>>>>> *2. Deprecating Native File Format Readers/Writers in the API* >>>>>>>>>>>> >>>>>>>>>>>> The current API contains format-specific readers/writers for >>>>>>>>>>>> Parquet, Avro, and ORC. With the introduction of the InternalData >>>>>>>>>>>> and File >>>>>>>>>>>> Format APIs, Iceberg users can now write files using: >>>>>>>>>>>> >>>>>>>>>>>> - InternalData API for metadata files (manifest, manifest >>>>>>>>>>>> list, partition stats). >>>>>>>>>>>> - File Format API for data and delete files. >>>>>>>>>>>> >>>>>>>>>>>> I propose we deprecate the original format-specific writers and >>>>>>>>>>>> guide users to use the new APIs based on the target file type. If >>>>>>>>>>>> you’re >>>>>>>>>>>> aware of any use cases that still require the original >>>>>>>>>>>> format-specific >>>>>>>>>>>> writers, please share them. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Peter >>>>>>>>>>>> >>>>>>>>>>>> [1] - Position Delete File Spec: >>>>>>>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files >>>>>>>>>>>> >>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >>>>>>>>>>>> 2025. júl. 22., K, 16:09): >>>>>>>>>>>> >>>>>>>>>>>>> Also put together a solution where the Engine specific format >>>>>>>>>>>>> transformation is separated from the writer, and the engines need >>>>>>>>>>>>> to take >>>>>>>>>>>>> care of it separately. >>>>>>>>>>>>> This is somewhat complicated on the implementation side (see: >>>>>>>>>>>>> [RowDataTransformer]( >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08), >>>>>>>>>>>>> and [InternalRowTransformer]( >>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)), >>>>>>>>>>>>> but simplifies the API. >>>>>>>>>>>>> >>>>>>>>>>>>> @rdblue: Please check the proposed solution. I think this is >>>>>>>>>>>>> what you have suggested >>>>>>>>>>>>> >>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >>>>>>>>>>>>> 2025. jún. 30., H, 18:42): >>>>>>>>>>>>> >>>>>>>>>>>>>> During the PR review [1], we began exploring what could we >>>>>>>>>>>>>> use as an intermediate layer to reduce the need for engines and >>>>>>>>>>>>>> file >>>>>>>>>>>>>> formats to implement the full matrix of file format - object >>>>>>>>>>>>>> model >>>>>>>>>>>>>> conversions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> To support this discussion, I’ve created and run a set of >>>>>>>>>>>>>> performance benchmarks and compiled a document outlining the >>>>>>>>>>>>>> potential >>>>>>>>>>>>>> benefits and trade-offs [2]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Feedback is welcome, feel free to comment on the document, >>>>>>>>>>>>>> the PR, or directly in this thread. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Peter >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] - PR discussion - >>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096 >>>>>>>>>>>>>> [2] - File Format and engine object model transformation >>>>>>>>>>>>>> performance - >>>>>>>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >>>>>>>>>>>>>> 2025. máj. 7., Sze, 13:15): >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>>>> The proposed API part is reviewed and ready to go. See: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774 >>>>>>>>>>>>>>> Thanks to everyone who reviewed it already! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Many of you wanted to review, but I know that the time >>>>>>>>>>>>>>> constraints are there for everyone. I still very much would >>>>>>>>>>>>>>> like to hear >>>>>>>>>>>>>>> your voices, so I will not merge the PR this week. Please >>>>>>>>>>>>>>> review it if you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: >>>>>>>>>>>>>>> 2025. ápr. 16., Sze, 7:02): >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Renjie, >>>>>>>>>>>>>>>> The first one for the proposed new API is here: >>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774 >>>>>>>>>>>>>>>> Thanks, Peter >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu < >>>>>>>>>>>>>>>> liurenjie2...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, Peter: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for the effort. I totally agree with splitting them >>>>>>>>>>>>>>>>> into smaller prs to move forward. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm quite interested in this topic, and please ping me in >>>>>>>>>>>>>>>>> those splitted prs and I'll help to review. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré < >>>>>>>>>>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Peter >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Awesome ! Thank you so much ! >>>>>>>>>>>>>>>>>> I will do a new pass. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>> JB >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry < >>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Hi JB, >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Separated out the proposed interfaces to a new PR: >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774. >>>>>>>>>>>>>>>>>> > Reviewers can check that out if they are only >>>>>>>>>>>>>>>>>> interested in how the new API would look like. >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>>>>>>> > Peter >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta >>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 10., Cs, 18:25): >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Hi Peter >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Thanks for the ping about the PR. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward >>>>>>>>>>>>>>>>>> faster, we should >>>>>>>>>>>>>>>>>> >> split the PR in smaller PRs: >>>>>>>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder, >>>>>>>>>>>>>>>>>> AppenderBuilder, ObjectModel, >>>>>>>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...) >>>>>>>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro, ORC) >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed. >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> Regards >>>>>>>>>>>>>>>>>> >> JB >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry < >>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been created, >>>>>>>>>>>>>>>>>> I would like to resurrect this PR: >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to ensure >>>>>>>>>>>>>>>>>> that we have as long a testing period as possible for it. >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > To recap, here is what the PR does after the review >>>>>>>>>>>>>>>>>> rounds: >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Created 3 interface classes which are implemented by >>>>>>>>>>>>>>>>>> the file formats: >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from data >>>>>>>>>>>>>>>>>> files >>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to data >>>>>>>>>>>>>>>>>> files >>>>>>>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and >>>>>>>>>>>>>>>>>> AppenderBuilders for the specific data file format and >>>>>>>>>>>>>>>>>> object model pair >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation for >>>>>>>>>>>>>>>>>> this interfaces, and deprecated the old reader/writer APIs >>>>>>>>>>>>>>>>>> >> > Created interface classes which will be used by the >>>>>>>>>>>>>>>>>> actual readers/writers of the data files: >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file >>>>>>>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a data >>>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for generating >>>>>>>>>>>>>>>>>> a position delete file >>>>>>>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating >>>>>>>>>>>>>>>>>> an equality delete file >>>>>>>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader builder >>>>>>>>>>>>>>>>>> is reused >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Created a WriterBuilder class which implements the >>>>>>>>>>>>>>>>>> interfaces above >>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >>>>>>>>>>>>>>>>>> based on a provided file format specific AppenderBuilder >>>>>>>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores the >>>>>>>>>>>>>>>>>> available ObjectModels, and engines and users could request >>>>>>>>>>>>>>>>>> the readers >>>>>>>>>>>>>>>>>> (ReadBuilder) and writers >>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) >>>>>>>>>>>>>>>>>> from. >>>>>>>>>>>>>>>>>> >> > Created the appropriate ObjectModels: >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing >>>>>>>>>>>>>>>>>> Iceberg Records >>>>>>>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized and >>>>>>>>>>>>>>>>>> non-vectorized) and writing Spark InternalRow/ColumnarBatch >>>>>>>>>>>>>>>>>> objects >>>>>>>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing Flink >>>>>>>>>>>>>>>>>> RowData objects >>>>>>>>>>>>>>>>>> >> > An arrow object model is also registered for >>>>>>>>>>>>>>>>>> vectorized reads of Parquet files into Arrow ColumnarBatch >>>>>>>>>>>>>>>>>> objects >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Updated the production code where the reading and >>>>>>>>>>>>>>>>>> writing happens to use the ObjectModelRegistry and the new >>>>>>>>>>>>>>>>>> reader/writer >>>>>>>>>>>>>>>>>> interfaces to access data files >>>>>>>>>>>>>>>>>> >> > Kept the testing code intact to ensure that the new >>>>>>>>>>>>>>>>>> API/code is not breaking anything >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > The original change was not small, and grew >>>>>>>>>>>>>>>>>> substantially during the review rounds. So if you have >>>>>>>>>>>>>>>>>> questions, or I can >>>>>>>>>>>>>>>>>> do anything to make the review easier, don't hesitate to >>>>>>>>>>>>>>>>>> ask. I am happy to >>>>>>>>>>>>>>>>>> do anything to move this forward. >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Thanks, >>>>>>>>>>>>>>>>>> >> > Peter >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta >>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 26., Sze, 14:54): >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Hi everyone, >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> I have updated the File Format API PR ( >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) based on >>>>>>>>>>>>>>>>>> the answers and review comments. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> I would like to merge this only after the 1.9.0 >>>>>>>>>>>>>>>>>> release so we have more time finding any issues and solving >>>>>>>>>>>>>>>>>> them before >>>>>>>>>>>>>>>>>> this goes to a release for the users. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> For this I have updated the deprecation comments >>>>>>>>>>>>>>>>>> accordingly. >>>>>>>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so we >>>>>>>>>>>>>>>>>> iron out any possible requested changes and be ready for the >>>>>>>>>>>>>>>>>> merge as soon >>>>>>>>>>>>>>>>>> as possible after the 1.9.0 release. >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Thanks, >>>>>>>>>>>>>>>>>> >> >> Peter >>>>>>>>>>>>>>>>>> >> >> >>>>>>>>>>>>>>>>>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta >>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:32): >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> Hi Renije, >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> > 1. File format filters >>>>>>>>>>>>>>>>>> >> >>> > >>>>>>>>>>>>>>>>>> >> >>> > Do the filters include both filter expressions >>>>>>>>>>>>>>>>>> from both user query and delete filter? >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> The current discussion is about the filters from >>>>>>>>>>>>>>>>>> the user query. >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> About the delete filter: >>>>>>>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have moved >>>>>>>>>>>>>>>>>> the delete filter out from the main API. Created a >>>>>>>>>>>>>>>>>> `SupportsDeleteFilter` >>>>>>>>>>>>>>>>>> interface for it which would allow pushing down to the >>>>>>>>>>>>>>>>>> filter to Parquet >>>>>>>>>>>>>>>>>> vectorized readers in Spark, as this is the only place where >>>>>>>>>>>>>>>>>> we currently >>>>>>>>>>>>>>>>>> implemented this feature. >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> >>>>>>>>>>>>>>>>>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta >>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:11): >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> Hi, Peter: >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> Thanks for the effort on this. >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> 1. File format filters >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> Do the filters include both filter expressions >>>>>>>>>>>>>>>>>> from both user query and delete filter? >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> For filters from user query, I agree with you >>>>>>>>>>>>>>>>>> that we should keep the current behavior. >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> For delete filters associated with data files, at >>>>>>>>>>>>>>>>>> first I thought file format readers should not care about >>>>>>>>>>>>>>>>>> this. But now I >>>>>>>>>>>>>>>>>> realized that maybe we need to also push it to file reader, >>>>>>>>>>>>>>>>>> this is useful >>>>>>>>>>>>>>>>>> when `IS_DELETED` metadata column is not necessary and we >>>>>>>>>>>>>>>>>> could use these >>>>>>>>>>>>>>>>>> filters (position deletes, etc) to further prune data. >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone it in >>>>>>>>>>>>>>>>>> follow up pr. >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> 2. Batch size configuration >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> I'm leaning toward option 2. >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> 3. Spark configuration >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> I'm leaning towards using different configuration >>>>>>>>>>>>>>>>>> objects. >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> >>>>>>>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry < >>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> Hi Team, >>>>>>>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298! >>>>>>>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a few >>>>>>>>>>>>>>>>>> questions still remain which might merit a bit wider >>>>>>>>>>>>>>>>>> audience: >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering >>>>>>>>>>>>>>>>>> behavior when the filters are pushed down to the readers. >>>>>>>>>>>>>>>>>> Currently the >>>>>>>>>>>>>>>>>> filters are applied as best effort for the file format >>>>>>>>>>>>>>>>>> readers. Some >>>>>>>>>>>>>>>>>> readers (Avro) just skip them altogether. There was a >>>>>>>>>>>>>>>>>> suggestion on the PR >>>>>>>>>>>>>>>>>> that we might enforce more strict requirements and the >>>>>>>>>>>>>>>>>> readers either >>>>>>>>>>>>>>>>>> reject part of the filters, or they could apply them fully. >>>>>>>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for the >>>>>>>>>>>>>>>>>> reader builders which could be set for non-vectorized >>>>>>>>>>>>>>>>>> readers too which >>>>>>>>>>>>>>>>>> could be confusing. >>>>>>>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses different >>>>>>>>>>>>>>>>>> configuration objects for ParquetBatchReadConf and >>>>>>>>>>>>>>>>>> OrcBatchReadConf as >>>>>>>>>>>>>>>>>> requested by the reviewers of the Comet PR. There was a >>>>>>>>>>>>>>>>>> suggestion on the >>>>>>>>>>>>>>>>>> current PR to use a common configuration instead. >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> I would be interested in hearing your thoughts >>>>>>>>>>>>>>>>>> about these topics. >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> My current take: >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards >>>>>>>>>>>>>>>>>> keeping the current laninet behavior. Especially since Bloom >>>>>>>>>>>>>>>>>> filters are >>>>>>>>>>>>>>>>>> not able to do a full filtering, and are often used as a way >>>>>>>>>>>>>>>>>> to filter out >>>>>>>>>>>>>>>>>> unwanted records. Another option would be to implement a >>>>>>>>>>>>>>>>>> secondary >>>>>>>>>>>>>>>>>> filtering inside the file formats themselves which I think >>>>>>>>>>>>>>>>>> would cause >>>>>>>>>>>>>>>>>> extra complexity, and possible code duplication. Whatever >>>>>>>>>>>>>>>>>> the decision >>>>>>>>>>>>>>>>>> here, I would suggest moving this out to a next PR as the >>>>>>>>>>>>>>>>>> current changeset >>>>>>>>>>>>>>>>>> is big enough as it is. >>>>>>>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this is the >>>>>>>>>>>>>>>>>> only property which is different in the batch readers and the >>>>>>>>>>>>>>>>>> non-vectorized readers. I see 3 possible solutions: >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> Create different builders for vectorized and >>>>>>>>>>>>>>>>>> non-vectorized reads - I don't think the current solution is >>>>>>>>>>>>>>>>>> confusing >>>>>>>>>>>>>>>>>> enough to worth the extra class >>>>>>>>>>>>>>>>>> >> >>>>> We could put this to the reader configuration >>>>>>>>>>>>>>>>>> property set - This could work, but "hide" the possible >>>>>>>>>>>>>>>>>> configuration mode >>>>>>>>>>>>>>>>>> which is valid for both Parquet and ORC readers >>>>>>>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I would >>>>>>>>>>>>>>>>>> chose this one, but I don't have a strong opinion here >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot >>>>>>>>>>>>>>>>>> solution and happy to move to the direction the community >>>>>>>>>>>>>>>>>> decides on >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> Thanks, >>>>>>>>>>>>>>>>>> >> >>>>> Peter >>>>>>>>>>>>>>>>>> >> >>>>> >>>>>>>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta >>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 14., P, 16:31): >>>>>>>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>>>>>>> >> >>>>>> Hi Peter >>>>>>>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass on >>>>>>>>>>>>>>>>>> the PR. >>>>>>>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>>>>>>> >> >>>>>> Regards >>>>>>>>>>>>>>>>>> >> >>>>>> JB >>>>>>>>>>>>>>>>>> >> >>>>>> >>>>>>>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry < >>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >> >>>>>> > >>>>>>>>>>>>>>>>>> >> >>>>>> > Hi Team, >>>>>>>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API proposal ( >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to include >>>>>>>>>>>>>>>>>> the new changes needed for the Variant types. I would love >>>>>>>>>>>>>>>>>> to hear your >>>>>>>>>>>>>>>>>> feedback, especially Dan and Ryan, as you were the most >>>>>>>>>>>>>>>>>> active during our >>>>>>>>>>>>>>>>>> discussions. If I can help in any way to make the review >>>>>>>>>>>>>>>>>> easier, please let >>>>>>>>>>>>>>>>>> me know. >>>>>>>>>>>>>>>>>> >> >>>>>> > Thanks, >>>>>>>>>>>>>>>>>> >> >>>>>> > Peter >>>>>>>>>>>>>>>>>> >> >>>>>> > >>>>>>>>>>>>>>>>>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt >>>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 28., P, 17:50): >>>>>>>>>>>>>>>>>> >> >>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>> >> Hi everyone, >>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant >>>>>>>>>>>>>>>>>> feedback on the PR ( >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298). >>>>>>>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of them. >>>>>>>>>>>>>>>>>> Please check if you agree with the general approach. >>>>>>>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the general >>>>>>>>>>>>>>>>>> approach, I could. separate out the PR to smaller pieces so >>>>>>>>>>>>>>>>>> we can have an >>>>>>>>>>>>>>>>>> easier time to review and merge those step-by-step. >>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks, >>>>>>>>>>>>>>>>>> >> >>>>>> >> Peter >>>>>>>>>>>>>>>>>> >> >>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt >>>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 20., Cs, 14:14): >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> Hi Peter >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very >>>>>>>>>>>>>>>>>> interesting and well written. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely >>>>>>>>>>>>>>>>>> worth to discuss all together. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to >>>>>>>>>>>>>>>>>> discuss about DataFile API ? >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> Thoughts ? >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> Regards >>>>>>>>>>>>>>>>>> >> >>>>>> >>> JB >>>>>>>>>>>>>>>>>> >> >>>>>> >>> >>>>>>>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry < >>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Hi Team, >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community >>>>>>>>>>>>>>>>>> Sync I am exploring the possibility to define a FileFormat >>>>>>>>>>>>>>>>>> API for >>>>>>>>>>>>>>>>>> accessing different file formats. I have put together a >>>>>>>>>>>>>>>>>> proposal based on >>>>>>>>>>>>>>>>>> my findings. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > ------------------- >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different >>>>>>>>>>>>>>>>>> file formats: Avro, Parquet, ORC. With the introduction of >>>>>>>>>>>>>>>>>> Iceberg V3 >>>>>>>>>>>>>>>>>> specification many new features are added to Iceberg. Some >>>>>>>>>>>>>>>>>> of these >>>>>>>>>>>>>>>>>> features like new column types, default values require >>>>>>>>>>>>>>>>>> changes at the file >>>>>>>>>>>>>>>>>> format level. The changes are added by individual developers >>>>>>>>>>>>>>>>>> with different >>>>>>>>>>>>>>>>>> focus on the different file formats. As a result not all of >>>>>>>>>>>>>>>>>> the features >>>>>>>>>>>>>>>>>> are available for every supported file format. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file formats like >>>>>>>>>>>>>>>>>> Vortex [1] or Lance [2] which either by specialization, or >>>>>>>>>>>>>>>>>> by applying >>>>>>>>>>>>>>>>>> newer research results could provide better alternatives for >>>>>>>>>>>>>>>>>> certain >>>>>>>>>>>>>>>>>> use-cases like random access for data, or storing ML models. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > ------------------- >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal [3] >>>>>>>>>>>>>>>>>> and the google document [4], and comment there or reply on >>>>>>>>>>>>>>>>>> the dev list if >>>>>>>>>>>>>>>>>> you have any suggestions. >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Thanks, >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Peter >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/ >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [3] - >>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225 >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [4] - >>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds >>>>>>>>>>>>>>>>>> >> >>>>>> >>> > >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>