Re: [DISCUSS] FileFormat API proposal

Steven Wu Mon, 15 Sep 2025 15:58:24 -0700

Peter, thanks for summarizing the 4 options. Both 0 and 1 seem good to
me, as they are explicit and easier to deprecate and remove the position
deletes in the future. Maybe option 0 is a tiny bit better as it is similar
to the existing FileWriterFactory API.


I will leave PR related comments in the PR directly.

On Mon, Sep 15, 2025 at 8:38 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Thanks for the feedback @Russell and @Renjie!
>
> Updated the PR accordingly.
> Also removed the possibility to set the row schema for the position delete
> writer. We will not need that after the PDWR deprecation.
>
> You can see one possible implementation in
> https://github.com/apache/iceberg/pull/12298 - we can discuss that
> separately. I just made sure that the new API is able to serve all of the
> current needs.
>
> @Ryan: What are your thoughts?
>
> Are we in a stage when we can vote on the current API?
>
> Thanks,
> Peter
>
>
> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025. szept. 15.,
> H, 12:08):
>
>> I would also vote for option 0. This api has clean separation and makes
>> refactoring easier, e.g. when we completely deprecate v2 table, we could
>> mark the *positionDeleteWriteBuilder *method as deprecated, and it would
>> be easier to remove its usage.
>>
>> On Fri, Sep 12, 2025 at 11:24 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Now that I fully understand the situation I think option 0 as you've
>>> written is probably the best thing to do as long as PositionDelete is a
>>> class. I
>>> think with hindsight it probably shouldn't have been a class and always
>>> been an interface so that our internal code could produce rows which
>>> implement PositionDelete rather than PositionDeletes that wrap rows.
>>>
>>> On Fri, Sep 12, 2025 at 8:02 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Let me summarize the state a bit:
>>>>
>>>> The FileFormat interface needs to expose two distinct methods:
>>>>
>>>>    - WriteBuilder<InternalRow>
>>>>    - WriteBuilder<PositionDelete<InternalRow>>
>>>>       - After the PDWR deprecation this will be
>>>>       WriteBuilder<PositionDelete>
>>>>       - After V2 deprecation this will be not needed anymore
>>>>
>>>> Based on the file format methods, the Registry must support four
>>>> builder types:
>>>>
>>>>    - WriteBuilder<InternalRow>
>>>>    - DataWriteBuilder<InternalRow>
>>>>    - EqualityDeleteWriteBuilder<InternalRow>
>>>>    - PositionDeleteWriteBuilder<InternalRow>
>>>>
>>>>
>>>> *API Design Considerations*
>>>> There is an argument that the two WriteBuilder methods provided by
>>>> FileFormat are essentially the same, differing only in the writerFunction.
>>>> While this is technically correct for current implementations, I believe
>>>> the API should clearly distinguish between the two writer types to
>>>> highlight the differences.
>>>>
>>>> *Discussed Approaches*
>>>>
>>>> *0. Two Explicit Methods on FormatModel* (removed based on previous
>>>> comments, but I personally still prefer this)
>>>>
>>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>>> *WriteBuilder<PositionDelete<InternalRow>>
>>>> positionDeleteWriteBuilder(OutputFile outputFile); *
>>>>
>>>>
>>>> Pros: Clear separation of responsibilities
>>>>
>>>> *1. One Builder + One Converter*
>>>>
>>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema schema);*
>>>>
>>>>
>>>> Pros: Keeps the interface compact
>>>> Cons: Requires additional documentation and understanding why the
>>>> conversion logic is needed
>>>>
>>>> *2. Single Method with Javadoc Clarification* (most similar to the
>>>> current approach)
>>>>
>>>> *WriteBuilder writeBuilder(OutputFile outputFile); *
>>>>
>>>>
>>>> Pros: Minimalistic
>>>> Cons: Least explicit; relies entirely on documentation
>>>>
>>>> *2/b. Single Builder with Type Parameter *(based on Russell's
>>>> suggestion)
>>>>
>>>> *WriteBuilder writeBuilder(OutputFile outputFile);*
>>>> *// Usage: builder.build(Class<D> inputType)*
>>>>
>>>>
>>>> Pros: Flexible
>>>> Cons: Relies on documentation to clarify the available input types
>>>>
>>>> *Bonus*
>>>> Options 0 and 1 make it easier to phase out PositionDelete filtering
>>>> once V2 tables are deprecated.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>>> szept. 11., Cs, 18:36):
>>>>
>>>>> > Wouldn't PositionDelete<InternalRow> also be an InternalRow in this
>>>>> example? I think that's what i'm confused about.
>>>>>
>>>>> With the *second approach*, the WriteBuilder doesn’t need to handle
>>>>> PositionDelete objects directly. The conversion layer takes care of
>>>>> that, so the WriteBuilder only needs to work with InternalRow.
>>>>>
>>>>> With the *first approach*, we shift that responsibility to the
>>>>> WriteBuilder, which then has to support both InternalRow and
>>>>> PositionDelete<InternalRow>.
>>>>>
>>>>> In both cases, the FormatModelRegistry API will still expose the more
>>>>> concrete types (PositionDelete / InternalRow). However, under the *first
>>>>> approach*, the lower-level API only needs to handle InternalRow,
>>>>> simplifying its interface.
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025.
>>>>> szept. 11., Cs, 17:12):
>>>>>
>>>>>> Wouldn't PositionDelete<InternalRow> also be an InternalRow in this
>>>>>> example? I think that's what i'm confused about.
>>>>>>
>>>>>> On Thu, Sep 11, 2025 at 5:35 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks, Russell, for taking a look at this!
>>>>>>>
>>>>>>> We need to expose four methods on the user-facing API (
>>>>>>> FormatModelRegistry):
>>>>>>>
>>>>>>>    1. *writeBuilder* – for writing arbitrary files without Iceberg
>>>>>>>    metadata. In the Iceberg codebase, this is exposed via 
>>>>>>> FlinkAppenderFactory and
>>>>>>>    the GenericAppenderFactory for creating FileAppender<RowData>
>>>>>>>    and FileAppender<Record> only.
>>>>>>>    2. *dataWriteBuilder* – for creating and collecting metadata for
>>>>>>>    Iceberg DataFiles.
>>>>>>>    3. *equalityDeleteWriteBuilder* – for creating and collecting
>>>>>>>    metadata for Iceberg EqualityDeleteFiles.
>>>>>>>    4. *positionDeleteWriteBuilder* – for creating and collecting
>>>>>>>    metadata for Iceberg PositionDeleteFiles.
>>>>>>>
>>>>>>> We’d like to implement all four using a single WriteBuilder created
>>>>>>> by the FormatModels.
>>>>>>>
>>>>>>> Your suggestion is a good one—it helps formalize the requirements
>>>>>>> for the build method and also surfaces an important design question:
>>>>>>>
>>>>>>> *Who should be responsible for handling the differences between
>>>>>>> normal rows (InternalRow) and position deletes
>>>>>>> (PositionDelete<InternalRow>)*?
>>>>>>>
>>>>>>>    - Should we have a more complex WriteBuilder class that can
>>>>>>>    create both DataFileAppender and PositionDeleteAppender?
>>>>>>>    - Or should we push this responsibility to the engine-specific
>>>>>>>    code, where we already have some logic (e.g., pathTransformFunc)
>>>>>>>    needed by each engine to create the PositionDeleteAppender?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont:
>>>>>>> 2025. szept. 11., Cs, 0:11):
>>>>>>>
>>>>>>>> I'm a little confused here, I think Ryan mentioned this in the
>>>>>>>> comment here
>>>>>>>> https://github.com/apache/iceberg/pull/12774/files#r2254967177
>>>>>>>>
>>>>>>>> From my understanding there are two options?
>>>>>>>>
>>>>>>>> 1) We either are producing FormatModels that take a generic row
>>>>>>>> type D and produce writers that all take D and write files.
>>>>>>>>
>>>>>>>> 2) we are creating IcebergModel specific writers that take
>>>>>>>> DataFile, PositionDeleteFile, EqualityDeleteFile etc ... and write 
>>>>>>>> files
>>>>>>>>
>>>>>>>> The PositionDelete Converter issue seems to stem from attempting to
>>>>>>>> do both model 1 (being very generic) and 2, wanting special code to 
>>>>>>>> deal
>>>>>>>> with PositionDeleteFile<R> objects.
>>>>>>>>
>>>>>>>> It looks like the code in #12774 is mostly doing model 1, but we
>>>>>>>> are trying to add in a specific converter for 2?
>>>>>>>>
>>>>>>>> Maybe i'm totally lost here but I was assuming we would do
>>>>>>>> something a little scala-y like
>>>>>>>>
>>>>>>>> public <T> FileAppender<T> build(Class<T> type) {
>>>>>>>> if (type == DataFile.class) return (FileAppender<T>) new
>>>>>>>> DataFileAppender();
>>>>>>>> if (type == DeleteFile.class) return (FileAppender<T>) new
>>>>>>>> DeleteFileAppender();
>>>>>>>> // ...
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> So that we only register a single signature and if writer specific
>>>>>>>> implementation needs to do something special it can? I'm trying to 
>>>>>>>> catch
>>>>>>>> back up to speed on this PR so it may help to do a quick summary of the
>>>>>>>> current state and intent. (At least for me)
>>>>>>>>
>>>>>>>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Renjie,
>>>>>>>>> Thanks for taking a look!
>>>>>>>>>
>>>>>>>>> Let me clarify a few points:
>>>>>>>>> - The converter API is only required for writing position delete
>>>>>>>>> files for V2 tables
>>>>>>>>> - Currently, there are no plans to support vectorized writing via
>>>>>>>>> the java API
>>>>>>>>> - Even if we decide to support vectorized writes, I don't think we
>>>>>>>>> would like to implement it for Positional Deletes, which are 
>>>>>>>>> deprecated in
>>>>>>>>> the new spec.
>>>>>>>>> - Also, once the positional deletes - which contain the deleted
>>>>>>>>> rows - are deprecated (as planned), the conversion of the Position 
>>>>>>>>> Deletes
>>>>>>>>> with only file name and position would be trivial, even for the 
>>>>>>>>> vectorized
>>>>>>>>> writes.
>>>>>>>>>
>>>>>>>>> So from my perspective, the converter method exists purely for
>>>>>>>>> backward compatibility, and we intend to remove it as soon as 
>>>>>>>>> possible.
>>>>>>>>> Sacrificing good practices for the sake of a deprecated feature 
>>>>>>>>> doesn’t
>>>>>>>>> seem worthwhile to me.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025.
>>>>>>>>> szept. 8., H, 12:34):
>>>>>>>>>
>>>>>>>>>> Hi, Peter:
>>>>>>>>>>
>>>>>>>>>> I would vote for the first approach. In spite of the compromises
>>>>>>>>>> described, the api is still cleaner. Also I think there are some 
>>>>>>>>>> problems
>>>>>>>>>> with the converter api. For example, for vectorized implementations 
>>>>>>>>>> such as
>>>>>>>>>> comet which accepts columnar batch rather than rows, the converter 
>>>>>>>>>> method
>>>>>>>>>> would make things more complicated.
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry <
>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I’ve initiated a discussion thread regarding the deprecation of
>>>>>>>>>>> Position Deletes containing row data. You can follow it here:
>>>>>>>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s
>>>>>>>>>>>
>>>>>>>>>>> We can proceed with the discussion about the native
>>>>>>>>>>> reader/writer deprecation when we decided on the final API, as the 
>>>>>>>>>>> chosen
>>>>>>>>>>> design may influence our approach.
>>>>>>>>>>>
>>>>>>>>>>> Since then, one more question has come up - hopefully the last:
>>>>>>>>>>> *How should we handle Position Delete Writers?*
>>>>>>>>>>> The File Format API should return builders for either rows or
>>>>>>>>>>> PositionDelete objects. Currently the method
>>>>>>>>>>> `WriteBuilder.createWriterFunc(Function<MessageType,
>>>>>>>>>>> ParquetValueWriter<?>>)` defines the accepted input parameters for 
>>>>>>>>>>> the
>>>>>>>>>>> writer. Users are responsible for ensuring that the writer function 
>>>>>>>>>>> and the
>>>>>>>>>>> return type of the `WriteBuilder.build()` are compatible. In the 
>>>>>>>>>>> new API,
>>>>>>>>>>> we no longer expose writer functions. We still expose FileContent, 
>>>>>>>>>>> since
>>>>>>>>>>> writer configurations vary by content type, but we don’t expose the 
>>>>>>>>>>> types.
>>>>>>>>>>>
>>>>>>>>>>> There are two proposals for handling types for the WriteBuilders:
>>>>>>>>>>>
>>>>>>>>>>>    1. *Implicit Type Definition via FileContent* - the builder
>>>>>>>>>>>    parameter for FileContent would implicitly define the input type 
>>>>>>>>>>> for the
>>>>>>>>>>>    writer returned by build(), or
>>>>>>>>>>>    2. *Engine level conversion* - Engines would convert
>>>>>>>>>>>    PositionDelete objects to their native types.
>>>>>>>>>>>
>>>>>>>>>>> In code:
>>>>>>>>>>>
>>>>>>>>>>>    - In the 1st proposal, the
>>>>>>>>>>>    FormatModel.writeBuilder(OutputFile outputFile) can return 
>>>>>>>>>>> anything:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *    WriteBuilder builder =
>>>>>>>>>>>    FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, 
>>>>>>>>>>> outputFile);
>>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>>        .content(FileContent.DATA)         ....         .build(); *
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *   // Exposed, but
>>>>>>>>>>>    FormatModelRegistry.positionDeleteWriteBuilder should be used 
>>>>>>>>>>> instead
>>>>>>>>>>>     WriteBuilder builder = FormatModelRegistry.writeBuilder(PARQUET,
>>>>>>>>>>>    InternalRow.class, outputFile);
>>>>>>>>>>>     FileAppender<PositionDelete<InternalRow>> appender =
>>>>>>>>>>>    .schema(table.schema())         
>>>>>>>>>>> .content(FileContent.POSITION_DELETES)
>>>>>>>>>>>        ....         .build();*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - In the 2nd proposal, the FormatModel needs another method:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema
>>>>>>>>>>> schema);    *example implementation:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *    return delete -> {      deleteRecord.update(0,
>>>>>>>>>>> UTF8String.fromString(delete.path().toString()));
>>>>>>>>>>> deleteRecord.update(1, delete.pos());      deleteRecord.update(2,
>>>>>>>>>>> delete.row());      return deleteRecord;    };*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *    // Content is only used for writer property configuration
>>>>>>>>>>>      WriteBuilder<InternalRow> builder =
>>>>>>>>>>>    sparkFormatModel.writeBuilder(outputFile);
>>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>>        .content(FileContent.DATA)         ....         .build();*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Drawbacks
>>>>>>>>>>>
>>>>>>>>>>>    - Proposal 1:
>>>>>>>>>>>       - Type checking for the FileAppenders occurs only at
>>>>>>>>>>>       runtime, so user errors surface late.
>>>>>>>>>>>       - File Format specification must clearly specify which
>>>>>>>>>>>       builder type corresponds to which file content 
>>>>>>>>>>> parameter—generics would
>>>>>>>>>>>       offer better clarity.
>>>>>>>>>>>       - Inconsistent patterns between WriteBuilder and
>>>>>>>>>>>       ReadBuilder, as the latter can define output types via 
>>>>>>>>>>> generics.
>>>>>>>>>>>    - Proposal 2:
>>>>>>>>>>>       - Requires FormatModels to implement a converter method
>>>>>>>>>>>       to transform PositionDelete<InternalRow> into InternalRow.
>>>>>>>>>>>
>>>>>>>>>>> Since we deprecated writing position delete files in the V3
>>>>>>>>>>> spec, this extra method in the 2nd proposal will be deprecated too. 
>>>>>>>>>>> As a
>>>>>>>>>>> result, in the long run, we will have a nice, clean API.
>>>>>>>>>>> OTOH, if we accept the compromise described in the 1st proposal,
>>>>>>>>>>> the results of our decision will remain, even when the functions are
>>>>>>>>>>> removed.
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry <
>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>
>>>>>>>>>>>> During yesterday’s community sync, we discussed the current
>>>>>>>>>>>> state of the File Format API proposal and identified two key 
>>>>>>>>>>>> questions that
>>>>>>>>>>>> require input from the broader community:
>>>>>>>>>>>>
>>>>>>>>>>>> *1. Dropping support for Position Delete files with Row Data*
>>>>>>>>>>>>
>>>>>>>>>>>> The current Iceberg V2 spec [1] defines two types of position
>>>>>>>>>>>> delete files:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Files that store only the file name and row position.
>>>>>>>>>>>>    - Files that also store the deleted row data.
>>>>>>>>>>>>
>>>>>>>>>>>> Although this feature is defined in the spec and some tests
>>>>>>>>>>>> exist in the Iceberg codebase, we’re not aware of any actual 
>>>>>>>>>>>> implementation
>>>>>>>>>>>> using the second type (with row data). Supporting V2 table writing 
>>>>>>>>>>>> via the
>>>>>>>>>>>> new File Format API would be simpler if we dropped support for 
>>>>>>>>>>>> this feature.
>>>>>>>>>>>> If you know of any use case or reason to retain support for
>>>>>>>>>>>> position deletes with row data, please let us know.
>>>>>>>>>>>>
>>>>>>>>>>>> *2. Deprecating Native File Format Readers/Writers in the API*
>>>>>>>>>>>>
>>>>>>>>>>>> The current API contains format-specific readers/writers for
>>>>>>>>>>>> Parquet, Avro, and ORC. With the introduction of the InternalData 
>>>>>>>>>>>> and File
>>>>>>>>>>>> Format APIs, Iceberg users can now write files using:
>>>>>>>>>>>>
>>>>>>>>>>>>    - InternalData API for metadata files (manifest, manifest
>>>>>>>>>>>>    list, partition stats).
>>>>>>>>>>>>    - File Format API for data and delete files.
>>>>>>>>>>>>
>>>>>>>>>>>> I propose we deprecate the original format-specific writers and
>>>>>>>>>>>> guide users to use the new APIs based on the target file type. If 
>>>>>>>>>>>> you’re
>>>>>>>>>>>> aware of any use cases that still require the original 
>>>>>>>>>>>> format-specific
>>>>>>>>>>>> writers, please share them.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> [1] - Position Delete File Spec:
>>>>>>>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files
>>>>>>>>>>>>
>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>>>>>>>>>> 2025. júl. 22., K, 16:09):
>>>>>>>>>>>>
>>>>>>>>>>>>> Also put together a solution where the Engine specific format
>>>>>>>>>>>>> transformation is separated from the writer, and the engines need 
>>>>>>>>>>>>> to take
>>>>>>>>>>>>> care of it separately.
>>>>>>>>>>>>> This is somewhat complicated on the implementation side (see:
>>>>>>>>>>>>> [RowDataTransformer](
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
>>>>>>>>>>>>> and [InternalRowTransformer](
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
>>>>>>>>>>>>> but simplifies the API.
>>>>>>>>>>>>>
>>>>>>>>>>>>> @rdblue: Please check the proposed solution. I think this is
>>>>>>>>>>>>> what you have suggested
>>>>>>>>>>>>>
>>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>>>>>>>>>>> 2025. jún. 30., H, 18:42):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> During the PR review [1], we began exploring what could we
>>>>>>>>>>>>>> use as an intermediate layer to reduce the need for engines and 
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> formats to implement the full matrix of file format - object 
>>>>>>>>>>>>>> model
>>>>>>>>>>>>>> conversions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To support this discussion, I’ve created and run a set of
>>>>>>>>>>>>>> performance benchmarks and compiled a document outlining the 
>>>>>>>>>>>>>> potential
>>>>>>>>>>>>>> benefits and trade-offs [2].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Feedback is welcome, feel free to comment on the document,
>>>>>>>>>>>>>> the PR, or directly in this thread.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] - PR discussion -
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
>>>>>>>>>>>>>> [2] - File Format and engine object model transformation
>>>>>>>>>>>>>> performance -
>>>>>>>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>>>>>>>>>>>> 2025. máj. 7., Sze, 13:15):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>> The proposed API part is reviewed and ready to go. See:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>> Thanks to everyone who reviewed it already!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Many of you wanted to review, but I know that the time
>>>>>>>>>>>>>>> constraints are there for everyone. I still very much would 
>>>>>>>>>>>>>>> like to hear
>>>>>>>>>>>>>>> your voices, so I will not merge the PR this week. Please 
>>>>>>>>>>>>>>> review it if you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>>>>>>>>>>>>> 2025. ápr. 16., Sze, 7:02):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>>>>> The first one for the proposed new API is here:
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <
>>>>>>>>>>>>>>>> liurenjie2...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for the effort. I totally agree with splitting them
>>>>>>>>>>>>>>>>> into smaller prs to move forward.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm quite interested in this topic, and please ping me in
>>>>>>>>>>>>>>>>> those splitted prs and I'll help to review.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <
>>>>>>>>>>>>>>>>> j...@nanthrax.net> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Awesome ! Thank you so much !
>>>>>>>>>>>>>>>>>> I will do a new pass.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Hi JB,
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Separated out the proposed interfaces to a new PR:
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774.
>>>>>>>>>>>>>>>>>> > Reviewers can check that out if they are only
>>>>>>>>>>>>>>>>>> interested in how the new API would look like.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>>>>> > Peter
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 10., Cs, 18:25):
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> Hi Peter
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> Thanks for the ping about the PR.
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward
>>>>>>>>>>>>>>>>>> faster, we should
>>>>>>>>>>>>>>>>>> >> split the PR in smaller PRs:
>>>>>>>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder,
>>>>>>>>>>>>>>>>>> AppenderBuilder, ObjectModel,
>>>>>>>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>>>>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed.
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> Regards
>>>>>>>>>>>>>>>>>> >> JB
>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been created,
>>>>>>>>>>>>>>>>>> I would like to resurrect this PR:
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to ensure
>>>>>>>>>>>>>>>>>> that we have as long a testing period as possible for it.
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > To recap, here is what the PR does after the review
>>>>>>>>>>>>>>>>>> rounds:
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Created 3 interface classes which are implemented by
>>>>>>>>>>>>>>>>>> the file formats:
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from data
>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to data
>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and
>>>>>>>>>>>>>>>>>> AppenderBuilders for the specific data file format and 
>>>>>>>>>>>>>>>>>> object model pair
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation for
>>>>>>>>>>>>>>>>>> this interfaces, and deprecated the old reader/writer APIs
>>>>>>>>>>>>>>>>>> >> > Created interface classes which will be used by the
>>>>>>>>>>>>>>>>>> actual readers/writers of the data files:
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>>>>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a data
>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for generating
>>>>>>>>>>>>>>>>>> a position delete file
>>>>>>>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating
>>>>>>>>>>>>>>>>>> an equality delete file
>>>>>>>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader builder
>>>>>>>>>>>>>>>>>> is reused
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Created a WriterBuilder class which implements the
>>>>>>>>>>>>>>>>>> interfaces above
>>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>>> based on a provided file format specific AppenderBuilder
>>>>>>>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores the
>>>>>>>>>>>>>>>>>> available ObjectModels, and engines and users could request 
>>>>>>>>>>>>>>>>>> the readers
>>>>>>>>>>>>>>>>>> (ReadBuilder) and writers
>>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>>> from.
>>>>>>>>>>>>>>>>>> >> > Created the appropriate ObjectModels:
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing
>>>>>>>>>>>>>>>>>> Iceberg Records
>>>>>>>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized and
>>>>>>>>>>>>>>>>>> non-vectorized) and writing Spark InternalRow/ColumnarBatch 
>>>>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing Flink
>>>>>>>>>>>>>>>>>> RowData objects
>>>>>>>>>>>>>>>>>> >> > An arrow object model is also registered for
>>>>>>>>>>>>>>>>>> vectorized reads of Parquet files into Arrow ColumnarBatch 
>>>>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Updated the production code where the reading and
>>>>>>>>>>>>>>>>>> writing happens to use the ObjectModelRegistry and the new 
>>>>>>>>>>>>>>>>>> reader/writer
>>>>>>>>>>>>>>>>>> interfaces to access data files
>>>>>>>>>>>>>>>>>> >> > Kept the testing code intact to ensure that the new
>>>>>>>>>>>>>>>>>> API/code is not breaking anything
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > The original change was not small, and grew
>>>>>>>>>>>>>>>>>> substantially during the review rounds. So if you have 
>>>>>>>>>>>>>>>>>> questions, or I can
>>>>>>>>>>>>>>>>>> do anything to make the review easier, don't hesitate to 
>>>>>>>>>>>>>>>>>> ask. I am happy to
>>>>>>>>>>>>>>>>>> do anything to move this forward.
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Thanks,
>>>>>>>>>>>>>>>>>> >> > Peter
>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 26., Sze, 14:54):
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Hi everyone,
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> I have updated the File Format API PR (
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) based on
>>>>>>>>>>>>>>>>>> the answers and review comments.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> I would like to merge this only after the 1.9.0
>>>>>>>>>>>>>>>>>> release so we have more time finding any issues and solving 
>>>>>>>>>>>>>>>>>> them before
>>>>>>>>>>>>>>>>>> this goes to a release for the users.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> For this I have updated the deprecation comments
>>>>>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so we
>>>>>>>>>>>>>>>>>> iron out any possible requested changes and be ready for the 
>>>>>>>>>>>>>>>>>> merge as soon
>>>>>>>>>>>>>>>>>> as possible after the 1.9.0 release.
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Thanks,
>>>>>>>>>>>>>>>>>> >> >> Peter
>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:32):
>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>> >> >>> Hi Renije,
>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>> >> >>> > 1. File format filters
>>>>>>>>>>>>>>>>>> >> >>> >
>>>>>>>>>>>>>>>>>> >> >>> > Do the filters include both filter expressions
>>>>>>>>>>>>>>>>>> from both user query and delete filter?
>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>> >> >>> The current discussion is about the filters from
>>>>>>>>>>>>>>>>>> the user query.
>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>> >> >>> About the delete filter:
>>>>>>>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have moved
>>>>>>>>>>>>>>>>>> the delete filter out from the main API. Created a 
>>>>>>>>>>>>>>>>>> `SupportsDeleteFilter`
>>>>>>>>>>>>>>>>>> interface for it which would allow pushing down to the 
>>>>>>>>>>>>>>>>>> filter to Parquet
>>>>>>>>>>>>>>>>>> vectorized readers in Spark, as this is the only place where 
>>>>>>>>>>>>>>>>>> we currently
>>>>>>>>>>>>>>>>>> implemented this feature.
>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:11):
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> Hi, Peter:
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> Thanks for the effort on this.
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> 1. File format filters
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> Do the filters include both filter expressions
>>>>>>>>>>>>>>>>>> from both user query and delete filter?
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> For filters from user query, I agree with you
>>>>>>>>>>>>>>>>>> that we should keep the current behavior.
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> For delete filters associated with data files, at
>>>>>>>>>>>>>>>>>> first I thought file format readers should not care about 
>>>>>>>>>>>>>>>>>> this. But now I
>>>>>>>>>>>>>>>>>> realized that maybe we need to also push it to file reader, 
>>>>>>>>>>>>>>>>>> this is useful
>>>>>>>>>>>>>>>>>> when `IS_DELETED` metadata column is not necessary and we 
>>>>>>>>>>>>>>>>>> could use these
>>>>>>>>>>>>>>>>>> filters (position deletes, etc) to further prune data.
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone it in
>>>>>>>>>>>>>>>>>> follow up pr.
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> 2. Batch size configuration
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> I'm leaning toward option 2.
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> 3. Spark configuration
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> I'm leaning towards using different configuration
>>>>>>>>>>>>>>>>>> objects.
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> Hi Team,
>>>>>>>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>>>>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a few
>>>>>>>>>>>>>>>>>> questions still remain which might merit a bit wider 
>>>>>>>>>>>>>>>>>> audience:
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering
>>>>>>>>>>>>>>>>>> behavior when the filters are pushed down to the readers. 
>>>>>>>>>>>>>>>>>> Currently the
>>>>>>>>>>>>>>>>>> filters are applied as best effort for the file format 
>>>>>>>>>>>>>>>>>> readers. Some
>>>>>>>>>>>>>>>>>> readers (Avro) just skip them altogether. There was a 
>>>>>>>>>>>>>>>>>> suggestion on the PR
>>>>>>>>>>>>>>>>>> that we might enforce more strict requirements and the 
>>>>>>>>>>>>>>>>>> readers either
>>>>>>>>>>>>>>>>>> reject part of the filters, or they could apply them fully.
>>>>>>>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for the
>>>>>>>>>>>>>>>>>> reader builders which could be set for non-vectorized 
>>>>>>>>>>>>>>>>>> readers too which
>>>>>>>>>>>>>>>>>> could be confusing.
>>>>>>>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses different
>>>>>>>>>>>>>>>>>> configuration objects for ParquetBatchReadConf and 
>>>>>>>>>>>>>>>>>> OrcBatchReadConf as
>>>>>>>>>>>>>>>>>> requested by the reviewers of the Comet PR. There was a 
>>>>>>>>>>>>>>>>>> suggestion on the
>>>>>>>>>>>>>>>>>> current PR to use a common configuration instead.
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> I would be interested in hearing your thoughts
>>>>>>>>>>>>>>>>>> about these topics.
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> My current take:
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards
>>>>>>>>>>>>>>>>>> keeping the current laninet behavior. Especially since Bloom 
>>>>>>>>>>>>>>>>>> filters are
>>>>>>>>>>>>>>>>>> not able to do a full filtering, and are often used as a way 
>>>>>>>>>>>>>>>>>> to filter out
>>>>>>>>>>>>>>>>>> unwanted records. Another option would be to implement a 
>>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>>> filtering inside the file formats themselves which I think 
>>>>>>>>>>>>>>>>>> would cause
>>>>>>>>>>>>>>>>>> extra complexity, and possible code duplication. Whatever 
>>>>>>>>>>>>>>>>>> the decision
>>>>>>>>>>>>>>>>>> here, I would suggest moving this out to a next PR as the 
>>>>>>>>>>>>>>>>>> current changeset
>>>>>>>>>>>>>>>>>> is big enough as it is.
>>>>>>>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this is the
>>>>>>>>>>>>>>>>>> only property which is different in the batch readers and the
>>>>>>>>>>>>>>>>>> non-vectorized readers. I see 3 possible solutions:
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> Create different builders for vectorized and
>>>>>>>>>>>>>>>>>> non-vectorized reads - I don't think the current solution is 
>>>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>>>> enough to worth the extra class
>>>>>>>>>>>>>>>>>> >> >>>>> We could put this to the reader configuration
>>>>>>>>>>>>>>>>>> property set - This could work, but "hide" the possible 
>>>>>>>>>>>>>>>>>> configuration mode
>>>>>>>>>>>>>>>>>> which is valid for both Parquet and ORC readers
>>>>>>>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I would
>>>>>>>>>>>>>>>>>> chose this one, but I don't have a strong opinion here
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot
>>>>>>>>>>>>>>>>>> solution and happy to move to the direction the community 
>>>>>>>>>>>>>>>>>> decides on
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> Thanks,
>>>>>>>>>>>>>>>>>> >> >>>>> Peter
>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 14., P, 16:31):
>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>> >> >>>>>> Hi Peter
>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass on
>>>>>>>>>>>>>>>>>> the PR.
>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>> >> >>>>>> Regards
>>>>>>>>>>>>>>>>>> >> >>>>>> JB
>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> > Hi Team,
>>>>>>>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API proposal (
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to include
>>>>>>>>>>>>>>>>>> the new changes needed for the Variant types. I would love 
>>>>>>>>>>>>>>>>>> to hear your
>>>>>>>>>>>>>>>>>> feedback, especially Dan and Ryan, as you were the most 
>>>>>>>>>>>>>>>>>> active during our
>>>>>>>>>>>>>>>>>> discussions. If I can help in any way to make the review 
>>>>>>>>>>>>>>>>>> easier, please let
>>>>>>>>>>>>>>>>>> me know.
>>>>>>>>>>>>>>>>>> >> >>>>>> > Thanks,
>>>>>>>>>>>>>>>>>> >> >>>>>> > Peter
>>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt
>>>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 28., P, 17:50):
>>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>>> >> >>>>>> >> Hi everyone,
>>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant
>>>>>>>>>>>>>>>>>> feedback on the PR (
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298).
>>>>>>>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of them.
>>>>>>>>>>>>>>>>>> Please check if you agree with the general approach.
>>>>>>>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the general
>>>>>>>>>>>>>>>>>> approach, I could. separate out the PR to smaller pieces so 
>>>>>>>>>>>>>>>>>> we can have an
>>>>>>>>>>>>>>>>>> easier time to review and merge those step-by-step.
>>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks,
>>>>>>>>>>>>>>>>>> >> >>>>>> >> Peter
>>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt
>>>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 20., Cs, 14:14):
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Hi Peter
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very
>>>>>>>>>>>>>>>>>> interesting and well written.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely
>>>>>>>>>>>>>>>>>> worth to discuss all together.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to
>>>>>>>>>>>>>>>>>> discuss about DataFile API ?
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Thoughts ?
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Regards
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> JB
>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Hi Team,
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community
>>>>>>>>>>>>>>>>>> Sync I am exploring the possibility to define a FileFormat 
>>>>>>>>>>>>>>>>>> API for
>>>>>>>>>>>>>>>>>> accessing different file formats. I have put together a 
>>>>>>>>>>>>>>>>>> proposal based on
>>>>>>>>>>>>>>>>>> my findings.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different
>>>>>>>>>>>>>>>>>> file formats: Avro, Parquet, ORC. With the introduction of 
>>>>>>>>>>>>>>>>>> Iceberg V3
>>>>>>>>>>>>>>>>>> specification many new features are added to Iceberg. Some 
>>>>>>>>>>>>>>>>>> of these
>>>>>>>>>>>>>>>>>> features like new column types, default values require 
>>>>>>>>>>>>>>>>>> changes at the file
>>>>>>>>>>>>>>>>>> format level. The changes are added by individual developers 
>>>>>>>>>>>>>>>>>> with different
>>>>>>>>>>>>>>>>>> focus on the different file formats. As a result not all of 
>>>>>>>>>>>>>>>>>> the features
>>>>>>>>>>>>>>>>>> are available for every supported file format.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file formats like
>>>>>>>>>>>>>>>>>> Vortex [1] or Lance [2] which either by specialization, or 
>>>>>>>>>>>>>>>>>> by applying
>>>>>>>>>>>>>>>>>> newer research results could provide better alternatives for 
>>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>> use-cases like random access for data, or storing ML models.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal [3]
>>>>>>>>>>>>>>>>>> and the google document [4], and comment there or reply on 
>>>>>>>>>>>>>>>>>> the dev list if
>>>>>>>>>>>>>>>>>> you have any suggestions.
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Thanks,
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Peter
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [3] -
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [4] -
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to