Re: [DISCUSS] FileFormat API proposal

Renjie Liu Mon, 15 Sep 2025 03:15:24 -0700

I would also vote for option 0. This api has clean separation and makes
refactoring easier, e.g. when we completely deprecate v2 table, we could
mark the *positionDeleteWriteBuilder *method as deprecated, and it would be
easier to remove its usage.


On Fri, Sep 12, 2025 at 11:24 PM Russell Spitzer <[email protected]>
wrote:

> Now that I fully understand the situation I think option 0 as you've
> written is probably the best thing to do as long as PositionDelete is a
> class. I
> think with hindsight it probably shouldn't have been a class and always
> been an interface so that our internal code could produce rows which
> implement PositionDelete rather than PositionDeletes that wrap rows.
>
> On Fri, Sep 12, 2025 at 8:02 AM Péter Váry <[email protected]>
> wrote:
>
>> Let me summarize the state a bit:
>>
>> The FileFormat interface needs to expose two distinct methods:
>>
>>    - WriteBuilder<InternalRow>
>>    - WriteBuilder<PositionDelete<InternalRow>>
>>       - After the PDWR deprecation this will be
>>       WriteBuilder<PositionDelete>
>>       - After V2 deprecation this will be not needed anymore
>>
>> Based on the file format methods, the Registry must support four builder
>> types:
>>
>>    - WriteBuilder<InternalRow>
>>    - DataWriteBuilder<InternalRow>
>>    - EqualityDeleteWriteBuilder<InternalRow>
>>    - PositionDeleteWriteBuilder<InternalRow>
>>
>>
>> *API Design Considerations*
>> There is an argument that the two WriteBuilder methods provided by
>> FileFormat are essentially the same, differing only in the writerFunction.
>> While this is technically correct for current implementations, I believe
>> the API should clearly distinguish between the two writer types to
>> highlight the differences.
>>
>> *Discussed Approaches*
>>
>> *0. Two Explicit Methods on FormatModel* (removed based on previous
>> comments, but I personally still prefer this)
>>
>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>> *WriteBuilder<PositionDelete<InternalRow>>
>> positionDeleteWriteBuilder(OutputFile outputFile); *
>>
>>
>> Pros: Clear separation of responsibilities
>>
>> *1. One Builder + One Converter*
>>
>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema schema);*
>>
>>
>> Pros: Keeps the interface compact
>> Cons: Requires additional documentation and understanding why the
>> conversion logic is needed
>>
>> *2. Single Method with Javadoc Clarification* (most similar to the
>> current approach)
>>
>> *WriteBuilder writeBuilder(OutputFile outputFile); *
>>
>>
>> Pros: Minimalistic
>> Cons: Least explicit; relies entirely on documentation
>>
>> *2/b. Single Builder with Type Parameter *(based on Russell's suggestion)
>>
>> *WriteBuilder writeBuilder(OutputFile outputFile);*
>> *// Usage: builder.build(Class<D> inputType)*
>>
>>
>> Pros: Flexible
>> Cons: Relies on documentation to clarify the available input types
>>
>> *Bonus*
>> Options 0 and 1 make it easier to phase out PositionDelete filtering once
>> V2 tables are deprecated.
>>
>> Thanks,
>> Peter
>>
>> Péter Váry <[email protected]> ezt írta (időpont: 2025. szept.
>> 11., Cs, 18:36):
>>
>>> > Wouldn't PositionDelete<InternalRow> also be an InternalRow in this
>>> example? I think that's what i'm confused about.
>>>
>>> With the *second approach*, the WriteBuilder doesn’t need to handle
>>> PositionDelete objects directly. The conversion layer takes care of
>>> that, so the WriteBuilder only needs to work with InternalRow.
>>>
>>> With the *first approach*, we shift that responsibility to the
>>> WriteBuilder, which then has to support both InternalRow and
>>> PositionDelete<InternalRow>.
>>>
>>> In both cases, the FormatModelRegistry API will still expose the more
>>> concrete types (PositionDelete / InternalRow). However, under the *first
>>> approach*, the lower-level API only needs to handle InternalRow,
>>> simplifying its interface.
>>> Thanks,
>>> Peter
>>>
>>> Russell Spitzer <[email protected]> ezt írta (időpont: 2025.
>>> szept. 11., Cs, 17:12):
>>>
>>>> Wouldn't PositionDelete<InternalRow> also be an InternalRow in this
>>>> example? I think that's what i'm confused about.
>>>>
>>>> On Thu, Sep 11, 2025 at 5:35 AM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks, Russell, for taking a look at this!
>>>>>
>>>>> We need to expose four methods on the user-facing API (
>>>>> FormatModelRegistry):
>>>>>
>>>>>    1. *writeBuilder* – for writing arbitrary files without Iceberg
>>>>>    metadata. In the Iceberg codebase, this is exposed via 
>>>>> FlinkAppenderFactory and
>>>>>    the GenericAppenderFactory for creating FileAppender<RowData> and
>>>>>    FileAppender<Record> only.
>>>>>    2. *dataWriteBuilder* – for creating and collecting metadata for
>>>>>    Iceberg DataFiles.
>>>>>    3. *equalityDeleteWriteBuilder* – for creating and collecting
>>>>>    metadata for Iceberg EqualityDeleteFiles.
>>>>>    4. *positionDeleteWriteBuilder* – for creating and collecting
>>>>>    metadata for Iceberg PositionDeleteFiles.
>>>>>
>>>>> We’d like to implement all four using a single WriteBuilder created
>>>>> by the FormatModels.
>>>>>
>>>>> Your suggestion is a good one—it helps formalize the requirements for
>>>>> the build method and also surfaces an important design question:
>>>>>
>>>>> *Who should be responsible for handling the differences between normal
>>>>> rows (InternalRow) and position deletes (PositionDelete<InternalRow>)*
>>>>> ?
>>>>>
>>>>>    - Should we have a more complex WriteBuilder class that can create
>>>>>    both DataFileAppender and PositionDeleteAppender?
>>>>>    - Or should we push this responsibility to the engine-specific
>>>>>    code, where we already have some logic (e.g., pathTransformFunc)
>>>>>    needed by each engine to create the PositionDeleteAppender?
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>>
>>>>> Russell Spitzer <[email protected]> ezt írta (időpont: 2025.
>>>>> szept. 11., Cs, 0:11):
>>>>>
>>>>>> I'm a little confused here, I think Ryan mentioned this in the
>>>>>> comment here
>>>>>> https://github.com/apache/iceberg/pull/12774/files#r2254967177
>>>>>>
>>>>>> From my understanding there are two options?
>>>>>>
>>>>>> 1) We either are producing FormatModels that take a generic row type
>>>>>> D and produce writers that all take D and write files.
>>>>>>
>>>>>> 2) we are creating IcebergModel specific writers that take DataFile,
>>>>>> PositionDeleteFile, EqualityDeleteFile etc ... and write files
>>>>>>
>>>>>> The PositionDelete Converter issue seems to stem from attempting to
>>>>>> do both model 1 (being very generic) and 2, wanting special code to deal
>>>>>> with PositionDeleteFile<R> objects.
>>>>>>
>>>>>> It looks like the code in #12774 is mostly doing model 1, but we are
>>>>>> trying to add in a specific converter for 2?
>>>>>>
>>>>>> Maybe i'm totally lost here but I was assuming we would do
>>>>>> something a little scala-y like
>>>>>>
>>>>>> public <T> FileAppender<T> build(Class<T> type) {
>>>>>> if (type == DataFile.class) return (FileAppender<T>) new
>>>>>> DataFileAppender();
>>>>>> if (type == DeleteFile.class) return (FileAppender<T>) new
>>>>>> DeleteFileAppender();
>>>>>> // ...
>>>>>> }
>>>>>>
>>>>>>
>>>>>> So that we only register a single signature and if writer specific
>>>>>> implementation needs to do something special it can? I'm trying to catch
>>>>>> back up to speed on this PR so it may help to do a quick summary of the
>>>>>> current state and intent. (At least for me)
>>>>>>
>>>>>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Renjie,
>>>>>>> Thanks for taking a look!
>>>>>>>
>>>>>>> Let me clarify a few points:
>>>>>>> - The converter API is only required for writing position delete
>>>>>>> files for V2 tables
>>>>>>> - Currently, there are no plans to support vectorized writing via
>>>>>>> the java API
>>>>>>> - Even if we decide to support vectorized writes, I don't think we
>>>>>>> would like to implement it for Positional Deletes, which are deprecated 
>>>>>>> in
>>>>>>> the new spec.
>>>>>>> - Also, once the positional deletes - which contain the deleted rows
>>>>>>> - are deprecated (as planned), the conversion of the Position Deletes 
>>>>>>> with
>>>>>>> only file name and position would be trivial, even for the vectorized
>>>>>>> writes.
>>>>>>>
>>>>>>> So from my perspective, the converter method exists purely for
>>>>>>> backward compatibility, and we intend to remove it as soon as possible.
>>>>>>> Sacrificing good practices for the sake of a deprecated feature doesn’t
>>>>>>> seem worthwhile to me.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> Renjie Liu <[email protected]> ezt írta (időpont: 2025.
>>>>>>> szept. 8., H, 12:34):
>>>>>>>
>>>>>>>> Hi, Peter:
>>>>>>>>
>>>>>>>> I would vote for the first approach. In spite of the compromises
>>>>>>>> described, the api is still cleaner. Also I think there are some 
>>>>>>>> problems
>>>>>>>> with the converter api. For example, for vectorized implementations 
>>>>>>>> such as
>>>>>>>> comet which accepts columnar batch rather than rows, the converter 
>>>>>>>> method
>>>>>>>> would make things more complicated.
>>>>>>>>
>>>>>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I’ve initiated a discussion thread regarding the deprecation of
>>>>>>>>> Position Deletes containing row data. You can follow it here:
>>>>>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s
>>>>>>>>>
>>>>>>>>> We can proceed with the discussion about the native reader/writer
>>>>>>>>> deprecation when we decided on the final API, as the chosen design may
>>>>>>>>> influence our approach.
>>>>>>>>>
>>>>>>>>> Since then, one more question has come up - hopefully the last:
>>>>>>>>> *How should we handle Position Delete Writers?*
>>>>>>>>> The File Format API should return builders for either rows or
>>>>>>>>> PositionDelete objects. Currently the method
>>>>>>>>> `WriteBuilder.createWriterFunc(Function<MessageType,
>>>>>>>>> ParquetValueWriter<?>>)` defines the accepted input parameters for the
>>>>>>>>> writer. Users are responsible for ensuring that the writer function 
>>>>>>>>> and the
>>>>>>>>> return type of the `WriteBuilder.build()` are compatible. In the new 
>>>>>>>>> API,
>>>>>>>>> we no longer expose writer functions. We still expose FileContent, 
>>>>>>>>> since
>>>>>>>>> writer configurations vary by content type, but we don’t expose the 
>>>>>>>>> types.
>>>>>>>>>
>>>>>>>>> There are two proposals for handling types for the WriteBuilders:
>>>>>>>>>
>>>>>>>>>    1. *Implicit Type Definition via FileContent* - the builder
>>>>>>>>>    parameter for FileContent would implicitly define the input type 
>>>>>>>>> for the
>>>>>>>>>    writer returned by build(), or
>>>>>>>>>    2. *Engine level conversion* - Engines would convert
>>>>>>>>>    PositionDelete objects to their native types.
>>>>>>>>>
>>>>>>>>> In code:
>>>>>>>>>
>>>>>>>>>    - In the 1st proposal, the FormatModel.writeBuilder(OutputFile
>>>>>>>>>    outputFile) can return anything:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *    WriteBuilder builder =
>>>>>>>>>    FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, 
>>>>>>>>> outputFile);
>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>> .schema(table.schema())
>>>>>>>>>        .content(FileContent.DATA)         ....         .build(); *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *   // Exposed, but FormatModelRegistry.positionDeleteWriteBuilder
>>>>>>>>>    should be used instead    WriteBuilder builder
>>>>>>>>>    = FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, 
>>>>>>>>> outputFile);
>>>>>>>>>       FileAppender<PositionDelete<InternalRow>> appender =
>>>>>>>>>    .schema(table.schema())         
>>>>>>>>> .content(FileContent.POSITION_DELETES)
>>>>>>>>>        ....         .build();*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - In the 2nd proposal, the FormatModel needs another method:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema
>>>>>>>>> schema);    *example implementation:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *    return delete -> {      deleteRecord.update(0,
>>>>>>>>> UTF8String.fromString(delete.path().toString()));
>>>>>>>>> deleteRecord.update(1, delete.pos());      deleteRecord.update(2,
>>>>>>>>> delete.row());      return deleteRecord;    };*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *    // Content is only used for writer property configuration
>>>>>>>>>    WriteBuilder<InternalRow> builder =
>>>>>>>>>    sparkFormatModel.writeBuilder(outputFile);
>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>> .schema(table.schema())
>>>>>>>>>        .content(FileContent.DATA)         ....         .build();*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Drawbacks
>>>>>>>>>
>>>>>>>>>    - Proposal 1:
>>>>>>>>>       - Type checking for the FileAppenders occurs only at
>>>>>>>>>       runtime, so user errors surface late.
>>>>>>>>>       - File Format specification must clearly specify which
>>>>>>>>>       builder type corresponds to which file content 
>>>>>>>>> parameter—generics would
>>>>>>>>>       offer better clarity.
>>>>>>>>>       - Inconsistent patterns between WriteBuilder and
>>>>>>>>>       ReadBuilder, as the latter can define output types via generics.
>>>>>>>>>    - Proposal 2:
>>>>>>>>>       - Requires FormatModels to implement a converter method to
>>>>>>>>>       transform PositionDelete<InternalRow> into InternalRow.
>>>>>>>>>
>>>>>>>>> Since we deprecated writing position delete files in the V3 spec,
>>>>>>>>> this extra method in the 2nd proposal will be deprecated too. As a 
>>>>>>>>> result,
>>>>>>>>> in the long run, we will have a nice, clean API.
>>>>>>>>> OTOH, if we accept the compromise described in the 1st proposal,
>>>>>>>>> the results of our decision will remain, even when the functions are
>>>>>>>>> removed.
>>>>>>>>>
>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>> During yesterday’s community sync, we discussed the current state
>>>>>>>>>> of the File Format API proposal and identified two key questions that
>>>>>>>>>> require input from the broader community:
>>>>>>>>>>
>>>>>>>>>> *1. Dropping support for Position Delete files with Row Data*
>>>>>>>>>>
>>>>>>>>>> The current Iceberg V2 spec [1] defines two types of position
>>>>>>>>>> delete files:
>>>>>>>>>>
>>>>>>>>>>    - Files that store only the file name and row position.
>>>>>>>>>>    - Files that also store the deleted row data.
>>>>>>>>>>
>>>>>>>>>> Although this feature is defined in the spec and some tests exist
>>>>>>>>>> in the Iceberg codebase, we’re not aware of any actual 
>>>>>>>>>> implementation using
>>>>>>>>>> the second type (with row data). Supporting V2 table writing via the 
>>>>>>>>>> new
>>>>>>>>>> File Format API would be simpler if we dropped support for this 
>>>>>>>>>> feature.
>>>>>>>>>> If you know of any use case or reason to retain support for
>>>>>>>>>> position deletes with row data, please let us know.
>>>>>>>>>>
>>>>>>>>>> *2. Deprecating Native File Format Readers/Writers in the API*
>>>>>>>>>>
>>>>>>>>>> The current API contains format-specific readers/writers for
>>>>>>>>>> Parquet, Avro, and ORC. With the introduction of the InternalData 
>>>>>>>>>> and File
>>>>>>>>>> Format APIs, Iceberg users can now write files using:
>>>>>>>>>>
>>>>>>>>>>    - InternalData API for metadata files (manifest, manifest
>>>>>>>>>>    list, partition stats).
>>>>>>>>>>    - File Format API for data and delete files.
>>>>>>>>>>
>>>>>>>>>> I propose we deprecate the original format-specific writers and
>>>>>>>>>> guide users to use the new APIs based on the target file type. If 
>>>>>>>>>> you’re
>>>>>>>>>> aware of any use cases that still require the original 
>>>>>>>>>> format-specific
>>>>>>>>>> writers, please share them.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> [1] - Position Delete File Spec:
>>>>>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files
>>>>>>>>>>
>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>> 2025. júl. 22., K, 16:09):
>>>>>>>>>>
>>>>>>>>>>> Also put together a solution where the Engine specific format
>>>>>>>>>>> transformation is separated from the writer, and the engines need 
>>>>>>>>>>> to take
>>>>>>>>>>> care of it separately.
>>>>>>>>>>> This is somewhat complicated on the implementation side (see:
>>>>>>>>>>> [RowDataTransformer](
>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
>>>>>>>>>>> and [InternalRowTransformer](
>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
>>>>>>>>>>> but simplifies the API.
>>>>>>>>>>>
>>>>>>>>>>> @rdblue: Please check the proposed solution. I think this is
>>>>>>>>>>> what you have suggested
>>>>>>>>>>>
>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>> 2025. jún. 30., H, 18:42):
>>>>>>>>>>>
>>>>>>>>>>>> During the PR review [1], we began exploring what could we use
>>>>>>>>>>>> as an intermediate layer to reduce the need for engines and file 
>>>>>>>>>>>> formats to
>>>>>>>>>>>> implement the full matrix of file format - object model 
>>>>>>>>>>>> conversions.
>>>>>>>>>>>>
>>>>>>>>>>>> To support this discussion, I’ve created and run a set of
>>>>>>>>>>>> performance benchmarks and compiled a document outlining the 
>>>>>>>>>>>> potential
>>>>>>>>>>>> benefits and trade-offs [2].
>>>>>>>>>>>>
>>>>>>>>>>>> Feedback is welcome, feel free to comment on the document, the
>>>>>>>>>>>> PR, or directly in this thread.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Peter
>>>>>>>>>>>>
>>>>>>>>>>>> [1] - PR discussion -
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
>>>>>>>>>>>> [2] - File Format and engine object model transformation
>>>>>>>>>>>> performance -
>>>>>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>>>>>>>>>>>>
>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>>> 2025. máj. 7., Sze, 13:15):
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>> The proposed API part is reviewed and ready to go. See:
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>> Thanks to everyone who reviewed it already!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Many of you wanted to review, but I know that the time
>>>>>>>>>>>>> constraints are there for everyone. I still very much would like 
>>>>>>>>>>>>> to hear
>>>>>>>>>>>>> your voices, so I will not merge the PR this week. Please review 
>>>>>>>>>>>>> it if you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>>>>> 2025. ápr. 16., Sze, 7:02):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>>> The first one for the proposed new API is here:
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the effort. I totally agree with splitting them
>>>>>>>>>>>>>>> into smaller prs to move forward.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm quite interested in this topic, and please ping me in
>>>>>>>>>>>>>>> those splitted prs and I'll help to review.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Awesome ! Thank you so much !
>>>>>>>>>>>>>>>> I will do a new pass.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Hi JB,
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Separated out the proposed interfaces to a new PR:
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774.
>>>>>>>>>>>>>>>> > Reviewers can check that out if they are only interested
>>>>>>>>>>>>>>>> in how the new API would look like.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>>> > Peter
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 10., Cs, 18:25):
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Hi Peter
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Thanks for the ping about the PR.
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward faster,
>>>>>>>>>>>>>>>> we should
>>>>>>>>>>>>>>>> >> split the PR in smaller PRs:
>>>>>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder, AppenderBuilder,
>>>>>>>>>>>>>>>> ObjectModel,
>>>>>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed.
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Regards
>>>>>>>>>>>>>>>> >> JB
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been created, I
>>>>>>>>>>>>>>>> would like to resurrect this PR:
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to ensure
>>>>>>>>>>>>>>>> that we have as long a testing period as possible for it.
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > To recap, here is what the PR does after the review
>>>>>>>>>>>>>>>> rounds:
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Created 3 interface classes which are implemented by
>>>>>>>>>>>>>>>> the file formats:
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from data files
>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to data
>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and
>>>>>>>>>>>>>>>> AppenderBuilders for the specific data file format and object 
>>>>>>>>>>>>>>>> model pair
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation for this
>>>>>>>>>>>>>>>> interfaces, and deprecated the old reader/writer APIs
>>>>>>>>>>>>>>>> >> > Created interface classes which will be used by the
>>>>>>>>>>>>>>>> actual readers/writers of the data files:
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a data file
>>>>>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for generating a
>>>>>>>>>>>>>>>> position delete file
>>>>>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating
>>>>>>>>>>>>>>>> an equality delete file
>>>>>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader builder
>>>>>>>>>>>>>>>> is reused
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Created a WriterBuilder class which implements the
>>>>>>>>>>>>>>>> interfaces above
>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>> based on a provided file format specific AppenderBuilder
>>>>>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores the
>>>>>>>>>>>>>>>> available ObjectModels, and engines and users could request 
>>>>>>>>>>>>>>>> the readers
>>>>>>>>>>>>>>>> (ReadBuilder) and writers
>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>> from.
>>>>>>>>>>>>>>>> >> > Created the appropriate ObjectModels:
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing Iceberg
>>>>>>>>>>>>>>>> Records
>>>>>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized and
>>>>>>>>>>>>>>>> non-vectorized) and writing Spark InternalRow/ColumnarBatch 
>>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing Flink
>>>>>>>>>>>>>>>> RowData objects
>>>>>>>>>>>>>>>> >> > An arrow object model is also registered for
>>>>>>>>>>>>>>>> vectorized reads of Parquet files into Arrow ColumnarBatch 
>>>>>>>>>>>>>>>> objects
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Updated the production code where the reading and
>>>>>>>>>>>>>>>> writing happens to use the ObjectModelRegistry and the new 
>>>>>>>>>>>>>>>> reader/writer
>>>>>>>>>>>>>>>> interfaces to access data files
>>>>>>>>>>>>>>>> >> > Kept the testing code intact to ensure that the new
>>>>>>>>>>>>>>>> API/code is not breaking anything
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > The original change was not small, and grew
>>>>>>>>>>>>>>>> substantially during the review rounds. So if you have 
>>>>>>>>>>>>>>>> questions, or I can
>>>>>>>>>>>>>>>> do anything to make the review easier, don't hesitate to ask. 
>>>>>>>>>>>>>>>> I am happy to
>>>>>>>>>>>>>>>> do anything to move this forward.
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Thanks,
>>>>>>>>>>>>>>>> >> > Peter
>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>> >> > Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2025. márc. 26., Sze, 14:54):
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Hi everyone,
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> I have updated the File Format API PR (
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) based on the
>>>>>>>>>>>>>>>> answers and review comments.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> I would like to merge this only after the 1.9.0
>>>>>>>>>>>>>>>> release so we have more time finding any issues and solving 
>>>>>>>>>>>>>>>> them before
>>>>>>>>>>>>>>>> this goes to a release for the users.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> For this I have updated the deprecation comments
>>>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so we iron
>>>>>>>>>>>>>>>> out any possible requested changes and be ready for the merge 
>>>>>>>>>>>>>>>> as soon as
>>>>>>>>>>>>>>>> possible after the 1.9.0 release.
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Thanks,
>>>>>>>>>>>>>>>> >> >> Peter
>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>> >> >> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:32):
>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>> >> >>> Hi Renije,
>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>> >> >>> > 1. File format filters
>>>>>>>>>>>>>>>> >> >>> >
>>>>>>>>>>>>>>>> >> >>> > Do the filters include both filter expressions
>>>>>>>>>>>>>>>> from both user query and delete filter?
>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>> >> >>> The current discussion is about the filters from the
>>>>>>>>>>>>>>>> user query.
>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>> >> >>> About the delete filter:
>>>>>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have moved the
>>>>>>>>>>>>>>>> delete filter out from the main API. Created a 
>>>>>>>>>>>>>>>> `SupportsDeleteFilter`
>>>>>>>>>>>>>>>> interface for it which would allow pushing down to the filter 
>>>>>>>>>>>>>>>> to Parquet
>>>>>>>>>>>>>>>> vectorized readers in Spark, as this is the only place where 
>>>>>>>>>>>>>>>> we currently
>>>>>>>>>>>>>>>> implemented this feature.
>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>> >> >>> Renjie Liu <[email protected]> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:11):
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> Hi, Peter:
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> Thanks for the effort on this.
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> 1. File format filters
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> Do the filters include both filter expressions from
>>>>>>>>>>>>>>>> both user query and delete filter?
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> For filters from user query, I agree with you that
>>>>>>>>>>>>>>>> we should keep the current behavior.
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> For delete filters associated with data files, at
>>>>>>>>>>>>>>>> first I thought file format readers should not care about 
>>>>>>>>>>>>>>>> this. But now I
>>>>>>>>>>>>>>>> realized that maybe we need to also push it to file reader, 
>>>>>>>>>>>>>>>> this is useful
>>>>>>>>>>>>>>>> when `IS_DELETED` metadata column is not necessary and we 
>>>>>>>>>>>>>>>> could use these
>>>>>>>>>>>>>>>> filters (position deletes, etc) to further prune data.
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone it in
>>>>>>>>>>>>>>>> follow up pr.
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> 2. Batch size configuration
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> I'm leaning toward option 2.
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> 3. Spark configuration
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> I'm leaning towards using different configuration
>>>>>>>>>>>>>>>> objects.
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> Hi Team,
>>>>>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a few
>>>>>>>>>>>>>>>> questions still remain which might merit a bit wider audience:
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering
>>>>>>>>>>>>>>>> behavior when the filters are pushed down to the readers. 
>>>>>>>>>>>>>>>> Currently the
>>>>>>>>>>>>>>>> filters are applied as best effort for the file format 
>>>>>>>>>>>>>>>> readers. Some
>>>>>>>>>>>>>>>> readers (Avro) just skip them altogether. There was a 
>>>>>>>>>>>>>>>> suggestion on the PR
>>>>>>>>>>>>>>>> that we might enforce more strict requirements and the readers 
>>>>>>>>>>>>>>>> either
>>>>>>>>>>>>>>>> reject part of the filters, or they could apply them fully.
>>>>>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for the
>>>>>>>>>>>>>>>> reader builders which could be set for non-vectorized readers 
>>>>>>>>>>>>>>>> too which
>>>>>>>>>>>>>>>> could be confusing.
>>>>>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses different
>>>>>>>>>>>>>>>> configuration objects for ParquetBatchReadConf and 
>>>>>>>>>>>>>>>> OrcBatchReadConf as
>>>>>>>>>>>>>>>> requested by the reviewers of the Comet PR. There was a 
>>>>>>>>>>>>>>>> suggestion on the
>>>>>>>>>>>>>>>> current PR to use a common configuration instead.
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> I would be interested in hearing your thoughts
>>>>>>>>>>>>>>>> about these topics.
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> My current take:
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards keeping
>>>>>>>>>>>>>>>> the current laninet behavior. Especially since Bloom filters 
>>>>>>>>>>>>>>>> are not able
>>>>>>>>>>>>>>>> to do a full filtering, and are often used as a way to filter 
>>>>>>>>>>>>>>>> out unwanted
>>>>>>>>>>>>>>>> records. Another option would be to implement a secondary 
>>>>>>>>>>>>>>>> filtering inside
>>>>>>>>>>>>>>>> the file formats themselves which I think would cause extra 
>>>>>>>>>>>>>>>> complexity, and
>>>>>>>>>>>>>>>> possible code duplication. Whatever the decision here, I would 
>>>>>>>>>>>>>>>> suggest
>>>>>>>>>>>>>>>> moving this out to a next PR as the current changeset is big 
>>>>>>>>>>>>>>>> enough as it
>>>>>>>>>>>>>>>> is.
>>>>>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this is the
>>>>>>>>>>>>>>>> only property which is different in the batch readers and the
>>>>>>>>>>>>>>>> non-vectorized readers. I see 3 possible solutions:
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> Create different builders for vectorized and
>>>>>>>>>>>>>>>> non-vectorized reads - I don't think the current solution is 
>>>>>>>>>>>>>>>> confusing
>>>>>>>>>>>>>>>> enough to worth the extra class
>>>>>>>>>>>>>>>> >> >>>>> We could put this to the reader configuration
>>>>>>>>>>>>>>>> property set - This could work, but "hide" the possible 
>>>>>>>>>>>>>>>> configuration mode
>>>>>>>>>>>>>>>> which is valid for both Parquet and ORC readers
>>>>>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I would chose
>>>>>>>>>>>>>>>> this one, but I don't have a strong opinion here
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution
>>>>>>>>>>>>>>>> and happy to move to the direction the community decides on
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> Thanks,
>>>>>>>>>>>>>>>> >> >>>>> Peter
>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2025. márc. 14., P, 16:31):
>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>> >> >>>>>> Hi Peter
>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass on
>>>>>>>>>>>>>>>> the PR.
>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>> >> >>>>>> Regards
>>>>>>>>>>>>>>>> >> >>>>>> JB
>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>> >> >>>>>> > Hi Team,
>>>>>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API proposal (
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to include
>>>>>>>>>>>>>>>> the new changes needed for the Variant types. I would love to 
>>>>>>>>>>>>>>>> hear your
>>>>>>>>>>>>>>>> feedback, especially Dan and Ryan, as you were the most active 
>>>>>>>>>>>>>>>> during our
>>>>>>>>>>>>>>>> discussions. If I can help in any way to make the review 
>>>>>>>>>>>>>>>> easier, please let
>>>>>>>>>>>>>>>> me know.
>>>>>>>>>>>>>>>> >> >>>>>> > Thanks,
>>>>>>>>>>>>>>>> >> >>>>>> > Peter
>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>> >> >>>>>> > Péter Váry <[email protected]> ezt
>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 28., P, 17:50):
>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>> >> >>>>>> >> Hi everyone,
>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant
>>>>>>>>>>>>>>>> feedback on the PR (
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298).
>>>>>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of them.
>>>>>>>>>>>>>>>> Please check if you agree with the general approach.
>>>>>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the general
>>>>>>>>>>>>>>>> approach, I could. separate out the PR to smaller pieces so we 
>>>>>>>>>>>>>>>> can have an
>>>>>>>>>>>>>>>> easier time to review and merge those step-by-step.
>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks,
>>>>>>>>>>>>>>>> >> >>>>>> >> Peter
>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt
>>>>>>>>>>>>>>>> írta (időpont: 2025. febr. 20., Cs, 14:14):
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> Hi Peter
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very
>>>>>>>>>>>>>>>> interesting and well written.
>>>>>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely worth
>>>>>>>>>>>>>>>> to discuss all together.
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to
>>>>>>>>>>>>>>>> discuss about DataFile API ?
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> Thoughts ?
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> Regards
>>>>>>>>>>>>>>>> >> >>>>>> >>> JB
>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>> >> >>>>>> >>> > Hi Team,
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync
>>>>>>>>>>>>>>>> I am exploring the possibility to define a FileFormat API for 
>>>>>>>>>>>>>>>> accessing
>>>>>>>>>>>>>>>> different file formats. I have put together a proposal based 
>>>>>>>>>>>>>>>> on my findings.
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different file
>>>>>>>>>>>>>>>> formats: Avro, Parquet, ORC. With the introduction of Iceberg 
>>>>>>>>>>>>>>>> V3
>>>>>>>>>>>>>>>> specification many new features are added to Iceberg. Some of 
>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>> features like new column types, default values require changes 
>>>>>>>>>>>>>>>> at the file
>>>>>>>>>>>>>>>> format level. The changes are added by individual developers 
>>>>>>>>>>>>>>>> with different
>>>>>>>>>>>>>>>> focus on the different file formats. As a result not all of 
>>>>>>>>>>>>>>>> the features
>>>>>>>>>>>>>>>> are available for every supported file format.
>>>>>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file formats like
>>>>>>>>>>>>>>>> Vortex [1] or Lance [2] which either by specialization, or by 
>>>>>>>>>>>>>>>> applying
>>>>>>>>>>>>>>>> newer research results could provide better alternatives for 
>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>> use-cases like random access for data, or storing ML models.
>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal [3] and
>>>>>>>>>>>>>>>> the google document [4], and comment there or reply on the dev 
>>>>>>>>>>>>>>>> list if you
>>>>>>>>>>>>>>>> have any suggestions.
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>> >> >>>>>> >>> > Thanks,
>>>>>>>>>>>>>>>> >> >>>>>> >>> > Peter
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>>>>>>>>>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>>>>>>>>>>>>> >> >>>>>> >>> > [3] -
>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225
>>>>>>>>>>>>>>>> >> >>>>>> >>> > [4] -
>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to