Re: Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-05-13 Thread Péter Váry
Hi Qian,
What are the main blockers for you to move to V3?
Many engines already support V3, and migration should be straightforward.
Thanks,
Peter

Qian Su  ezt írta (időpont: 2026. máj. 13., Sze,
0:57):

> Thanks for driving this proposal forward! Our organization is still
> largely running on V2 tables. Because migrating our entire footprint to V3
> will take some time, having a V2-compatible conversion path would be
> incredibly valuable for our use cases. I would be very interested in
> contributing to the V2 path to help support this. Please let me know where
> it makes the most sense for me to jump in—whether collaborating directly on
> the current PR or picking it up as a follow-up once the initial V3
> foundation lands.
>
> Qian
>
> On 2026/04/17 13:58:28 Maximilian Michels wrote:
> > Hi Steven,
> >
> > Thanks for chiming in! Yes, the current solution requires V3 tables.
> > I'll probably have to make that clearer in the PR description.
> >
> > In principle, it wouldn't be hard to make it work with V2 either,
> > obviously with the tradeoffs that come along with it in terms of
> > storage / lookup efficiency.
> >
> > Cheers,
> > Max
> >
> > On Thu, Apr 16, 2026 at 5:30 PM Steven Wu  wrote:
> > >
> > > Sorry for the delayed response. I agree that this proposal could get
> us one step closer to removing equality deletes. It can also verify whether
> the index solution can scale for streaming CDC/Upsert writes.
> > >
> > > > For this PR, we opted to use deletion vectors over regular delete
> files due to their efficiency in terms of space and lookup.
> > >
> > > I didn't quite get this from the PR description. Does this mean it is
> limited to V3 tables?
> > >
> > > Thanks for working on this, Max!
> > >
> > > On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels 
> wrote:
> > >>
> > >> Thanks everyone for the discussion and support.
> > >>
> > >> I've opened a PR which implements what we discussed here:
> > >> https://github.com/apache/iceberg/pull/15996
> > >>
> > >> Just to cap:
> > >>
> > >> The EqualityDeleteConverter (EDC) will be running in-line with the
> > >> writer, which produces the equality deletes to a staging branch. We
> > >> monitor the staging branch for new commits, build a sharded primary
> > >> key index in Flink state (backed by RocksDB for large tables), resolve
> > >> equality deletes against the index, and commit back the resulting data
> > >> files + DVs to the main branch.
> > >>
> > >> The PR is split into several commits, which we can break out into
> > >> separate PRs for easier review. There are some limitations and
> > >> follow-ups listed in the PR. The biggest gaps are preserving row
> > >> lineage and lifecycle management of the staging branch. In a
> > >> follow-up, we will also add integration with the Flink IcebergSink. I
> > >> tried to keep the scope limited to the EDC maintenance task for the
> > >> first PR.
> > >>
> > >> Cheers,
> > >> Max
> > >>
> > >> On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi 
> wrote:
> > >> >
> > >> > Thanks for raising this, Max and for the feedback Peter and Manu.
> > >> >
> > >> > I am supportive of this proposal, especially with the clearly
> defined vision of eventually completely removing the need for equality
> deletes.
> > >> >
> > >> > Lifting the reliance on equality deletes in the Flink write path
> would be a significant improvement, both in terms of read efficiency (by
> moving towards delete vectors) and in terms of new capabilities, as it
> would make writing upserts from Flink a viable path to explore going
> forward.
> > >> >
> > >> > Cheers,
> > >> > Marton
> > >> >
> > >> > On 2026/03/20 11:15:26 Maximilian Michels wrote:
> > >> > > Thanks Peter and Manu for the feedback.
> > >> > >
> > >> > > @Peter:
> > >> > >
> > >> > > Good point on the end goal. The end goal should be to completely
> > >> > > remove equality deletes.
> > >> > >
> > >> > > While the staging branch, which contains the equality deletes, is
> an
> > >> > > internal implementation detail of the Flink writer, it will still
> be
> > >> > > accessible via the Iceberg reader API. For the transition period,
> I
> > >> > > think this has several advantages:
> > >> > > 1. We don't need to fundamentally change the write logic of
> existing writers.
> > >> > > 2. We still allow for the data to be inspected before converting
> it
> > >> > > and merging it to the main branch. This is also helpful for
> > >> > > troubleshooting.
> > >> > >
> > >> > > The staging branch solution is a first step towards removing
> equality deletes.
> > >> > >
> > >> > > In V4, we could already deprecate equality deletes. Once the spec
> > >> > > includes indices, we can move the index into Iceberg, which should
> > >> > > make it easier to develop an in-place resolution of equality
> deletes
> > >> > > supporting multiple writers and conflict resolution. Admittedly,
> we
> > >> > > haven't fully figured out the best in-place approach. I think it
> is a
> > >> > > good idea to take i

RE: Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-05-12 Thread Qian Su
Thanks for driving this proposal forward! Our organization is still largely 
running on V2 tables. Because migrating our entire footprint to V3 will take 
some time, having a V2-compatible conversion path would be incredibly valuable 
for our use cases. I would be very interested in contributing to the V2 path to 
help support this. Please let me know where it makes the most sense for me to 
jump in—whether collaborating directly on the current PR or picking it up as a 
follow-up once the initial V3 foundation lands.

Qian

On 2026/04/17 13:58:28 Maximilian Michels wrote:
> Hi Steven,
> 
> Thanks for chiming in! Yes, the current solution requires V3 tables.
> I'll probably have to make that clearer in the PR description.
> 
> In principle, it wouldn't be hard to make it work with V2 either,
> obviously with the tradeoffs that come along with it in terms of
> storage / lookup efficiency.
> 
> Cheers,
> Max
> 
> On Thu, Apr 16, 2026 at 5:30 PM Steven Wu  wrote:
> >
> > Sorry for the delayed response. I agree that this proposal could get us one 
> > step closer to removing equality deletes. It can also verify whether the 
> > index solution can scale for streaming CDC/Upsert writes.
> >
> > > For this PR, we opted to use deletion vectors over regular delete files 
> > > due to their efficiency in terms of space and lookup.
> >
> > I didn't quite get this from the PR description. Does this mean it is 
> > limited to V3 tables?
> >
> > Thanks for working on this, Max!
> >
> > On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels  wrote:
> >>
> >> Thanks everyone for the discussion and support.
> >>
> >> I've opened a PR which implements what we discussed here:
> >> https://github.com/apache/iceberg/pull/15996
> >>
> >> Just to cap:
> >>
> >> The EqualityDeleteConverter (EDC) will be running in-line with the
> >> writer, which produces the equality deletes to a staging branch. We
> >> monitor the staging branch for new commits, build a sharded primary
> >> key index in Flink state (backed by RocksDB for large tables), resolve
> >> equality deletes against the index, and commit back the resulting data
> >> files + DVs to the main branch.
> >>
> >> The PR is split into several commits, which we can break out into
> >> separate PRs for easier review. There are some limitations and
> >> follow-ups listed in the PR. The biggest gaps are preserving row
> >> lineage and lifecycle management of the staging branch. In a
> >> follow-up, we will also add integration with the Flink IcebergSink. I
> >> tried to keep the scope limited to the EDC maintenance task for the
> >> first PR.
> >>
> >> Cheers,
> >> Max
> >>
> >> On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi  wrote:
> >> >
> >> > Thanks for raising this, Max and for the feedback Peter and Manu.
> >> >
> >> > I am supportive of this proposal, especially with the clearly defined 
> >> > vision of eventually completely removing the need for equality deletes.
> >> >
> >> > Lifting the reliance on equality deletes in the Flink write path would 
> >> > be a significant improvement, both in terms of read efficiency (by 
> >> > moving towards delete vectors) and in terms of new capabilities, as it 
> >> > would make writing upserts from Flink a viable path to explore going 
> >> > forward.
> >> >
> >> > Cheers,
> >> > Marton
> >> >
> >> > On 2026/03/20 11:15:26 Maximilian Michels wrote:
> >> > > Thanks Peter and Manu for the feedback.
> >> > >
> >> > > @Peter:
> >> > >
> >> > > Good point on the end goal. The end goal should be to completely
> >> > > remove equality deletes.
> >> > >
> >> > > While the staging branch, which contains the equality deletes, is an
> >> > > internal implementation detail of the Flink writer, it will still be
> >> > > accessible via the Iceberg reader API. For the transition period, I
> >> > > think this has several advantages:
> >> > > 1. We don't need to fundamentally change the write logic of existing 
> >> > > writers.
> >> > > 2. We still allow for the data to be inspected before converting it
> >> > > and merging it to the main branch. This is also helpful for
> >> > > troubleshooting.
> >> > >
> >> > > The staging branch solution is a first step towards removing equality 
> >> > > deletes.
> >> > >
> >> > > In V4, we could already deprecate equality deletes. Once the spec
> >> > > includes indices, we can move the index into Iceberg, which should
> >> > > make it easier to develop an in-place resolution of equality deletes
> >> > > supporting multiple writers and conflict resolution. Admittedly, we
> >> > > haven't fully figured out the best in-place approach. I think it is a
> >> > > good idea to take it one step at a time.
> >> > >
> >> > > On row lineage: If we want to preserve the row id of updated rows, we
> >> > > will have to store the row id in the primary key index. Theoretically,
> >> > > we should be able to then add it to the corresponding new row. The
> >> > > question is how to do that efficiently, such that we don't have to
> >> 

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-04-17 Thread Maximilian Michels
Hi Steven,

Thanks for chiming in! Yes, the current solution requires V3 tables.
I'll probably have to make that clearer in the PR description.

In principle, it wouldn't be hard to make it work with V2 either,
obviously with the tradeoffs that come along with it in terms of
storage / lookup efficiency.

Cheers,
Max

On Thu, Apr 16, 2026 at 5:30 PM Steven Wu  wrote:
>
> Sorry for the delayed response. I agree that this proposal could get us one 
> step closer to removing equality deletes. It can also verify whether the 
> index solution can scale for streaming CDC/Upsert writes.
>
> > For this PR, we opted to use deletion vectors over regular delete files due 
> > to their efficiency in terms of space and lookup.
>
> I didn't quite get this from the PR description. Does this mean it is limited 
> to V3 tables?
>
> Thanks for working on this, Max!
>
> On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels  wrote:
>>
>> Thanks everyone for the discussion and support.
>>
>> I've opened a PR which implements what we discussed here:
>> https://github.com/apache/iceberg/pull/15996
>>
>> Just to cap:
>>
>> The EqualityDeleteConverter (EDC) will be running in-line with the
>> writer, which produces the equality deletes to a staging branch. We
>> monitor the staging branch for new commits, build a sharded primary
>> key index in Flink state (backed by RocksDB for large tables), resolve
>> equality deletes against the index, and commit back the resulting data
>> files + DVs to the main branch.
>>
>> The PR is split into several commits, which we can break out into
>> separate PRs for easier review. There are some limitations and
>> follow-ups listed in the PR. The biggest gaps are preserving row
>> lineage and lifecycle management of the staging branch. In a
>> follow-up, we will also add integration with the Flink IcebergSink. I
>> tried to keep the scope limited to the EDC maintenance task for the
>> first PR.
>>
>> Cheers,
>> Max
>>
>> On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi  wrote:
>> >
>> > Thanks for raising this, Max and for the feedback Peter and Manu.
>> >
>> > I am supportive of this proposal, especially with the clearly defined 
>> > vision of eventually completely removing the need for equality deletes.
>> >
>> > Lifting the reliance on equality deletes in the Flink write path would be 
>> > a significant improvement, both in terms of read efficiency (by moving 
>> > towards delete vectors) and in terms of new capabilities, as it would make 
>> > writing upserts from Flink a viable path to explore going forward.
>> >
>> > Cheers,
>> > Marton
>> >
>> > On 2026/03/20 11:15:26 Maximilian Michels wrote:
>> > > Thanks Peter and Manu for the feedback.
>> > >
>> > > @Peter:
>> > >
>> > > Good point on the end goal. The end goal should be to completely
>> > > remove equality deletes.
>> > >
>> > > While the staging branch, which contains the equality deletes, is an
>> > > internal implementation detail of the Flink writer, it will still be
>> > > accessible via the Iceberg reader API. For the transition period, I
>> > > think this has several advantages:
>> > > 1. We don't need to fundamentally change the write logic of existing 
>> > > writers.
>> > > 2. We still allow for the data to be inspected before converting it
>> > > and merging it to the main branch. This is also helpful for
>> > > troubleshooting.
>> > >
>> > > The staging branch solution is a first step towards removing equality 
>> > > deletes.
>> > >
>> > > In V4, we could already deprecate equality deletes. Once the spec
>> > > includes indices, we can move the index into Iceberg, which should
>> > > make it easier to develop an in-place resolution of equality deletes
>> > > supporting multiple writers and conflict resolution. Admittedly, we
>> > > haven't fully figured out the best in-place approach. I think it is a
>> > > good idea to take it one step at a time.
>> > >
>> > > On row lineage: If we want to preserve the row id of updated rows, we
>> > > will have to store the row id in the primary key index. Theoretically,
>> > > we should be able to then add it to the corresponding new row. The
>> > > question is how to do that efficiently, such that we don't have to
>> > > rewrite any data files. We would need some way to map the row id of
>> > > the newly inserted row to the row id of the deleted row. Do we already
>> > > have such functionality in Iceberg?
>> > >
>> > > On concurrent writes: For the time being, I think we should not allow
>> > > concurrent maintenance tasks, including equality delete conversion.
>> > > Concurrent writes are still supported, as long as they go to the
>> > > staging branch.
>> > >
>> > > @Manu:
>> > >
>> > > +1 to Peter's response. The primary key index is bounded and
>> > > independent of the number of accumulated equality deletes, so memory
>> > > doesn't blow up, as long as we have sufficient resources to load the
>> > > index. We definitely cannot rely on the full index to fit into memory.
>> > > Fortunate

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-04-16 Thread Steven Wu
Sorry for the delayed response. I agree that this proposal could get us one
step closer to removing equality deletes. It can also verify whether the
index solution can scale for streaming CDC/Upsert writes.

> For this PR, we opted to use deletion vectors over regular delete files
due to their efficiency in terms of space and lookup.

I didn't quite get this from the PR description. Does this mean it is
limited to V3 tables?

Thanks for working on this, Max!

On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels  wrote:

> Thanks everyone for the discussion and support.
>
> I've opened a PR which implements what we discussed here:
> https://github.com/apache/iceberg/pull/15996
>
> Just to cap:
>
> The EqualityDeleteConverter (EDC) will be running in-line with the
> writer, which produces the equality deletes to a staging branch. We
> monitor the staging branch for new commits, build a sharded primary
> key index in Flink state (backed by RocksDB for large tables), resolve
> equality deletes against the index, and commit back the resulting data
> files + DVs to the main branch.
>
> The PR is split into several commits, which we can break out into
> separate PRs for easier review. There are some limitations and
> follow-ups listed in the PR. The biggest gaps are preserving row
> lineage and lifecycle management of the staging branch. In a
> follow-up, we will also add integration with the Flink IcebergSink. I
> tried to keep the scope limited to the EDC maintenance task for the
> first PR.
>
> Cheers,
> Max
>
> On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi  wrote:
> >
> > Thanks for raising this, Max and for the feedback Peter and Manu.
> >
> > I am supportive of this proposal, especially with the clearly defined
> vision of eventually completely removing the need for equality deletes.
> >
> > Lifting the reliance on equality deletes in the Flink write path would
> be a significant improvement, both in terms of read efficiency (by moving
> towards delete vectors) and in terms of new capabilities, as it would make
> writing upserts from Flink a viable path to explore going forward.
> >
> > Cheers,
> > Marton
> >
> > On 2026/03/20 11:15:26 Maximilian Michels wrote:
> > > Thanks Peter and Manu for the feedback.
> > >
> > > @Peter:
> > >
> > > Good point on the end goal. The end goal should be to completely
> > > remove equality deletes.
> > >
> > > While the staging branch, which contains the equality deletes, is an
> > > internal implementation detail of the Flink writer, it will still be
> > > accessible via the Iceberg reader API. For the transition period, I
> > > think this has several advantages:
> > > 1. We don't need to fundamentally change the write logic of existing
> writers.
> > > 2. We still allow for the data to be inspected before converting it
> > > and merging it to the main branch. This is also helpful for
> > > troubleshooting.
> > >
> > > The staging branch solution is a first step towards removing equality
> deletes.
> > >
> > > In V4, we could already deprecate equality deletes. Once the spec
> > > includes indices, we can move the index into Iceberg, which should
> > > make it easier to develop an in-place resolution of equality deletes
> > > supporting multiple writers and conflict resolution. Admittedly, we
> > > haven't fully figured out the best in-place approach. I think it is a
> > > good idea to take it one step at a time.
> > >
> > > On row lineage: If we want to preserve the row id of updated rows, we
> > > will have to store the row id in the primary key index. Theoretically,
> > > we should be able to then add it to the corresponding new row. The
> > > question is how to do that efficiently, such that we don't have to
> > > rewrite any data files. We would need some way to map the row id of
> > > the newly inserted row to the row id of the deleted row. Do we already
> > > have such functionality in Iceberg?
> > >
> > > On concurrent writes: For the time being, I think we should not allow
> > > concurrent maintenance tasks, including equality delete conversion.
> > > Concurrent writes are still supported, as long as they go to the
> > > staging branch.
> > >
> > > @Manu:
> > >
> > > +1 to Peter's response. The primary key index is bounded and
> > > independent of the number of accumulated equality deletes, so memory
> > > doesn't blow up, as long as we have sufficient resources to load the
> > > index. We definitely cannot rely on the full index to fit into memory.
> > > Fortunately, Flink is already prepared for this; it supports spilling
> > > to disk via its RocksDB state backend.
> > >
> > > Cheers,
> > > Max
> > >
> > > On Fri, Mar 20, 2026 at 11:07 AM Péter Váry <
> [email protected]> wrote:
> > > >
> > > > Equality delete resolution could be made significantly more
> efficient by using an index (e.g., backed by RocksDB) to store the current
> mapping from primary keys to (file, position). While the memory footprint
> would not be small, it would be bounded and inde

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-04-16 Thread Maximilian Michels
Thanks everyone for the discussion and support.

I've opened a PR which implements what we discussed here:
https://github.com/apache/iceberg/pull/15996

Just to cap:

The EqualityDeleteConverter (EDC) will be running in-line with the
writer, which produces the equality deletes to a staging branch. We
monitor the staging branch for new commits, build a sharded primary
key index in Flink state (backed by RocksDB for large tables), resolve
equality deletes against the index, and commit back the resulting data
files + DVs to the main branch.

The PR is split into several commits, which we can break out into
separate PRs for easier review. There are some limitations and
follow-ups listed in the PR. The biggest gaps are preserving row
lineage and lifecycle management of the staging branch. In a
follow-up, we will also add integration with the Flink IcebergSink. I
tried to keep the scope limited to the EDC maintenance task for the
first PR.

Cheers,
Max

On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi  wrote:
>
> Thanks for raising this, Max and for the feedback Peter and Manu.
>
> I am supportive of this proposal, especially with the clearly defined vision 
> of eventually completely removing the need for equality deletes.
>
> Lifting the reliance on equality deletes in the Flink write path would be a 
> significant improvement, both in terms of read efficiency (by moving towards 
> delete vectors) and in terms of new capabilities, as it would make writing 
> upserts from Flink a viable path to explore going forward.
>
> Cheers,
> Marton
>
> On 2026/03/20 11:15:26 Maximilian Michels wrote:
> > Thanks Peter and Manu for the feedback.
> >
> > @Peter:
> >
> > Good point on the end goal. The end goal should be to completely
> > remove equality deletes.
> >
> > While the staging branch, which contains the equality deletes, is an
> > internal implementation detail of the Flink writer, it will still be
> > accessible via the Iceberg reader API. For the transition period, I
> > think this has several advantages:
> > 1. We don't need to fundamentally change the write logic of existing 
> > writers.
> > 2. We still allow for the data to be inspected before converting it
> > and merging it to the main branch. This is also helpful for
> > troubleshooting.
> >
> > The staging branch solution is a first step towards removing equality 
> > deletes.
> >
> > In V4, we could already deprecate equality deletes. Once the spec
> > includes indices, we can move the index into Iceberg, which should
> > make it easier to develop an in-place resolution of equality deletes
> > supporting multiple writers and conflict resolution. Admittedly, we
> > haven't fully figured out the best in-place approach. I think it is a
> > good idea to take it one step at a time.
> >
> > On row lineage: If we want to preserve the row id of updated rows, we
> > will have to store the row id in the primary key index. Theoretically,
> > we should be able to then add it to the corresponding new row. The
> > question is how to do that efficiently, such that we don't have to
> > rewrite any data files. We would need some way to map the row id of
> > the newly inserted row to the row id of the deleted row. Do we already
> > have such functionality in Iceberg?
> >
> > On concurrent writes: For the time being, I think we should not allow
> > concurrent maintenance tasks, including equality delete conversion.
> > Concurrent writes are still supported, as long as they go to the
> > staging branch.
> >
> > @Manu:
> >
> > +1 to Peter's response. The primary key index is bounded and
> > independent of the number of accumulated equality deletes, so memory
> > doesn't blow up, as long as we have sufficient resources to load the
> > index. We definitely cannot rely on the full index to fit into memory.
> > Fortunately, Flink is already prepared for this; it supports spilling
> > to disk via its RocksDB state backend.
> >
> > Cheers,
> > Max
> >
> > On Fri, Mar 20, 2026 at 11:07 AM Péter Váry  
> > wrote:
> > >
> > > Equality delete resolution could be made significantly more efficient by 
> > > using an index (e.g., backed by RocksDB) to store the current mapping 
> > > from primary keys to (file, position). While the memory footprint would 
> > > not be small, it would be bounded and independent of the number of 
> > > accumulated equality deletes. In addition, even a blocking compaction 
> > > should block for a shorter period than the typical interval at which 
> > > table compactions are scheduled.
> > >
> > > Manu Zhang  ezt írta (időpont: 2026. márc. 19., 
> > > Cs, 15:57):
> > >>
> > >> Thanks Max for the proposal. One question here.
> > >> When the convert task can not finish in time (e.g. blocked by 
> > >> compactions), and equality deletes accumulate on the staging branch, 
> > >> will we have the same issue as loading too many equality deletes and 
> > >> blowing up memory?
> > >>
> > >> Regards,
> > >> Manu
> > >>
> > >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-04-02 Thread Márton Balassi
Thanks for raising this, Max and for the feedback Peter and Manu.

I am supportive of this proposal, especially with the clearly defined vision of 
eventually completely removing the need for equality deletes. 

Lifting the reliance on equality deletes in the Flink write path would be a 
significant improvement, both in terms of read efficiency (by moving towards 
delete vectors) and in terms of new capabilities, as it would make writing 
upserts from Flink a viable path to explore going forward. 

Cheers,
Marton

On 2026/03/20 11:15:26 Maximilian Michels wrote:
> Thanks Peter and Manu for the feedback.
> 
> @Peter:
> 
> Good point on the end goal. The end goal should be to completely
> remove equality deletes.
> 
> While the staging branch, which contains the equality deletes, is an
> internal implementation detail of the Flink writer, it will still be
> accessible via the Iceberg reader API. For the transition period, I
> think this has several advantages:
> 1. We don't need to fundamentally change the write logic of existing writers.
> 2. We still allow for the data to be inspected before converting it
> and merging it to the main branch. This is also helpful for
> troubleshooting.
> 
> The staging branch solution is a first step towards removing equality deletes.
> 
> In V4, we could already deprecate equality deletes. Once the spec
> includes indices, we can move the index into Iceberg, which should
> make it easier to develop an in-place resolution of equality deletes
> supporting multiple writers and conflict resolution. Admittedly, we
> haven't fully figured out the best in-place approach. I think it is a
> good idea to take it one step at a time.
> 
> On row lineage: If we want to preserve the row id of updated rows, we
> will have to store the row id in the primary key index. Theoretically,
> we should be able to then add it to the corresponding new row. The
> question is how to do that efficiently, such that we don't have to
> rewrite any data files. We would need some way to map the row id of
> the newly inserted row to the row id of the deleted row. Do we already
> have such functionality in Iceberg?
> 
> On concurrent writes: For the time being, I think we should not allow
> concurrent maintenance tasks, including equality delete conversion.
> Concurrent writes are still supported, as long as they go to the
> staging branch.
> 
> @Manu:
> 
> +1 to Peter's response. The primary key index is bounded and
> independent of the number of accumulated equality deletes, so memory
> doesn't blow up, as long as we have sufficient resources to load the
> index. We definitely cannot rely on the full index to fit into memory.
> Fortunately, Flink is already prepared for this; it supports spilling
> to disk via its RocksDB state backend.
> 
> Cheers,
> Max
> 
> On Fri, Mar 20, 2026 at 11:07 AM Péter Váry  
> wrote:
> >
> > Equality delete resolution could be made significantly more efficient by 
> > using an index (e.g., backed by RocksDB) to store the current mapping from 
> > primary keys to (file, position). While the memory footprint would not be 
> > small, it would be bounded and independent of the number of accumulated 
> > equality deletes. In addition, even a blocking compaction should block for 
> > a shorter period than the typical interval at which table compactions are 
> > scheduled.
> >
> > Manu Zhang  ezt írta (időpont: 2026. márc. 19., 
> > Cs, 15:57):
> >>
> >> Thanks Max for the proposal. One question here.
> >> When the convert task can not finish in time (e.g. blocked by 
> >> compactions), and equality deletes accumulate on the staging branch, will 
> >> we have the same issue as loading too many equality deletes and blowing up 
> >> memory?
> >>
> >> Regards,
> >> Manu
> >>
> >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry  
> >> wrote:
> >>>
> >>> Thanks, Max, for continuing to push this forward.
> >>>
> >>> The proposal feels like a step in the right direction, but I would like 
> >>> to see a clearer view of the end goal. As it stands, equality deletes 
> >>> remain in the spec because the changes are committed to an intermediate 
> >>> branch. Since the long‑term objective is to remove equality deletes from 
> >>> the specification altogether, we should be clear about the final solution 
> >>> that achieves this.
> >>>
> >>> Flink writes will also continue to have the limitation that row lineage 
> >>> is not maintained correctly. This is unchanged from the current 
> >>> situation, but I think it’s important to explicitly call this out, or 
> >>> ideally, explore whether there’s a way to address it.
> >>>
> >>> In addition, concurrent writes and compactions would require updating the 
> >>> primary key index, which could be expensive.
> >>>
> >>> That said, I don’t see a clearly better alternative at the moment, and 
> >>> overall this seems like a reasonable way forward.
> >>>
> >>> Thanks again for continuing to drive the proposal.
> >>> Peter
> >>>
> >>>
> >>> On Wed, Mar 18, 20

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-03-20 Thread Maximilian Michels
Thanks Peter and Manu for the feedback.

@Peter:

Good point on the end goal. The end goal should be to completely
remove equality deletes.

While the staging branch, which contains the equality deletes, is an
internal implementation detail of the Flink writer, it will still be
accessible via the Iceberg reader API. For the transition period, I
think this has several advantages:
1. We don't need to fundamentally change the write logic of existing writers.
2. We still allow for the data to be inspected before converting it
and merging it to the main branch. This is also helpful for
troubleshooting.

The staging branch solution is a first step towards removing equality deletes.

In V4, we could already deprecate equality deletes. Once the spec
includes indices, we can move the index into Iceberg, which should
make it easier to develop an in-place resolution of equality deletes
supporting multiple writers and conflict resolution. Admittedly, we
haven't fully figured out the best in-place approach. I think it is a
good idea to take it one step at a time.

On row lineage: If we want to preserve the row id of updated rows, we
will have to store the row id in the primary key index. Theoretically,
we should be able to then add it to the corresponding new row. The
question is how to do that efficiently, such that we don't have to
rewrite any data files. We would need some way to map the row id of
the newly inserted row to the row id of the deleted row. Do we already
have such functionality in Iceberg?

On concurrent writes: For the time being, I think we should not allow
concurrent maintenance tasks, including equality delete conversion.
Concurrent writes are still supported, as long as they go to the
staging branch.

@Manu:

+1 to Peter's response. The primary key index is bounded and
independent of the number of accumulated equality deletes, so memory
doesn't blow up, as long as we have sufficient resources to load the
index. We definitely cannot rely on the full index to fit into memory.
Fortunately, Flink is already prepared for this; it supports spilling
to disk via its RocksDB state backend.

Cheers,
Max

On Fri, Mar 20, 2026 at 11:07 AM Péter Váry  wrote:
>
> Equality delete resolution could be made significantly more efficient by 
> using an index (e.g., backed by RocksDB) to store the current mapping from 
> primary keys to (file, position). While the memory footprint would not be 
> small, it would be bounded and independent of the number of accumulated 
> equality deletes. In addition, even a blocking compaction should block for a 
> shorter period than the typical interval at which table compactions are 
> scheduled.
>
> Manu Zhang  ezt írta (időpont: 2026. márc. 19., Cs, 
> 15:57):
>>
>> Thanks Max for the proposal. One question here.
>> When the convert task can not finish in time (e.g. blocked by compactions), 
>> and equality deletes accumulate on the staging branch, will we have the same 
>> issue as loading too many equality deletes and blowing up memory?
>>
>> Regards,
>> Manu
>>
>> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry  
>> wrote:
>>>
>>> Thanks, Max, for continuing to push this forward.
>>>
>>> The proposal feels like a step in the right direction, but I would like to 
>>> see a clearer view of the end goal. As it stands, equality deletes remain 
>>> in the spec because the changes are committed to an intermediate branch. 
>>> Since the long‑term objective is to remove equality deletes from the 
>>> specification altogether, we should be clear about the final solution that 
>>> achieves this.
>>>
>>> Flink writes will also continue to have the limitation that row lineage is 
>>> not maintained correctly. This is unchanged from the current situation, but 
>>> I think it’s important to explicitly call this out, or ideally, explore 
>>> whether there’s a way to address it.
>>>
>>> In addition, concurrent writes and compactions would require updating the 
>>> primary key index, which could be expensive.
>>>
>>> That said, I don’t see a clearly better alternative at the moment, and 
>>> overall this seems like a reasonable way forward.
>>>
>>> Thanks again for continuing to drive the proposal.
>>> Peter
>>>
>>>
>>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels  wrote:

 Hi,

 I'd like to discuss resolving equality deletes in the Flink write
 path, which will get us one step closer to removing equality deletes
 from the spec.

 ## tl;dr

 We're planning to add an equality delete to deletion vector (DV)
 conversion to Flink. Equality deletes may remain as an internal
 intermediary format.

 ## Background

 For deletes, Flink currently produces equality delete files.

 Equality deletes are used to support deletes in the write path, which
 is a requirement for many use cases like CDC [3]. They are cheap for
 the writer; it only notes down the to-be-deleted values of the
 identifier fields inside so-called delete fi

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-03-20 Thread Péter Váry
Equality delete resolution could be made significantly more efficient by
using an index (e.g., backed by RocksDB) to store the current mapping from
primary keys to (file, position). While the memory footprint would not be
small, it would be bounded and independent of the number of accumulated
equality deletes. In addition, even a blocking compaction should block for
a shorter period than the typical interval at which table compactions are
scheduled.

Manu Zhang  ezt írta (időpont: 2026. márc. 19.,
Cs, 15:57):

> Thanks Max for the proposal. One question here.
> When the convert task can not finish in time (e.g. blocked by
> compactions), and equality deletes accumulate on the staging branch, will
> we have the same issue as loading too many equality deletes and blowing up
> memory?
>
> Regards,
> Manu
>
> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry 
> wrote:
>
>> Thanks, Max, for continuing to push this forward.
>>
>> The proposal feels like a step in the right direction, but I would like
>> to see a clearer view of the end goal. As it stands, equality deletes
>> remain in the spec because the changes are committed to an intermediate
>> branch. Since the long‑term objective is to remove equality deletes from
>> the specification altogether, we should be clear about the final solution
>> that achieves this.
>>
>> Flink writes will also continue to have the limitation that row lineage
>> is not maintained correctly. This is unchanged from the current situation,
>> but I think it’s important to explicitly call this out, or ideally, explore
>> whether there’s a way to address it.
>>
>> In addition, concurrent writes and compactions would require updating the
>> primary key index, which could be expensive.
>>
>> That said, I don’t see a clearly better alternative at the moment, and
>> overall this seems like a reasonable way forward.
>>
>> Thanks again for continuing to drive the proposal.
>> Peter
>>
>>
>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels  wrote:
>>
>>> Hi,
>>>
>>> I'd like to discuss resolving equality deletes in the Flink write
>>> path, which will get us one step closer to removing equality deletes
>>> from the spec.
>>>
>>> ## tl;dr
>>>
>>> We're planning to add an equality delete to deletion vector (DV)
>>> conversion to Flink. Equality deletes may remain as an internal
>>> intermediary format.
>>>
>>> ## Background
>>>
>>> For deletes, Flink currently produces equality delete files.
>>>
>>> Equality deletes are used to support deletes in the write path, which
>>> is a requirement for many use cases like CDC [3]. They are cheap for
>>> the writer; it only notes down the to-be-deleted values of the
>>> identifier fields inside so-called delete files, and leaves it up for
>>> the readers to match the values to the corresponding rows. The heavy
>>> lifting has to be done by the readers, which potentially need to scan
>>> the entire table to resolve equality deletes.
>>>
>>> Therefore, equality deletes have been criticized. There are
>>> discussions around deprecating / removing them [1].
>>>
>>> ## Resolving Equality Deletes
>>>
>>> Steven, Peter, and a few other contributors came up with a proposal to
>>> convert equality deletes into DV [2]. The original solution was quite
>>> complex, mainly due to the conflict handling between streaming writes,
>>> table maintenance, and equality delete resolution. The proposal is
>>> also blocked on index support in the Iceberg spec [5].
>>>
>>> We may need to simplify further to make some progress. The old table
>>> specs are going to be around for some time, even after we have a new
>>> spec with index support. Users have been asking for a solution to this
>>> issue for quite some time [3].
>>>
>>> The following is a modification of the original design document which
>>> adapts the ideas described under "use lock to avoid conflicts" [2].
>>>
>>> ## Proposed Solution
>>>
>>> The idea is to add the equality delete to deletion vector (DV)
>>> conversion as a Flink table maintenance task. After recent changes, we
>>> can now run the writer and the maintenance in the same Flink job and
>>> use a Flink-maintained lock to avoid conflicts between the maintenance
>>> tasks.
>>>
>>> 1. Instead of writing directly to the target branch, the writer
>>> commits data files + equality deletes to a staging branch.
>>> 2. The new "EqualityDeleteResolver" maintenance task reads from the
>>> staging branch and converts the equality deletes to DVs using a
>>> Flink-maintained primary key index, then commits data files + DVs to
>>> the target branch.
>>> 3. The existing Flink maintenance framework's lock mechanism ensures
>>> mutual exclusion between the convert task and table compaction to
>>> avoid conflicts.
>>>
>>> After conversion, the target branch contains only data files and DVs,
>>> no equality deletes.
>>>
>>> ## Limitations
>>>
>>> - Readers will only see new data until the conversion is complete.
>>> This is partially mitigated by the fact that snapshots with equal

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-03-19 Thread Manu Zhang
Thanks Max for the proposal. One question here.
When the convert task can not finish in time (e.g. blocked by compactions),
and equality deletes accumulate on the staging branch, will we have the
same issue as loading too many equality deletes and blowing up memory?

Regards,
Manu

On Thu, Mar 19, 2026 at 2:45 PM Péter Váry 
wrote:

> Thanks, Max, for continuing to push this forward.
>
> The proposal feels like a step in the right direction, but I would like to
> see a clearer view of the end goal. As it stands, equality deletes remain
> in the spec because the changes are committed to an intermediate branch.
> Since the long‑term objective is to remove equality deletes from the
> specification altogether, we should be clear about the final solution that
> achieves this.
>
> Flink writes will also continue to have the limitation that row lineage is
> not maintained correctly. This is unchanged from the current situation, but
> I think it’s important to explicitly call this out, or ideally, explore
> whether there’s a way to address it.
>
> In addition, concurrent writes and compactions would require updating the
> primary key index, which could be expensive.
>
> That said, I don’t see a clearly better alternative at the moment, and
> overall this seems like a reasonable way forward.
>
> Thanks again for continuing to drive the proposal.
> Peter
>
>
> On Wed, Mar 18, 2026, 16:47 Maximilian Michels  wrote:
>
>> Hi,
>>
>> I'd like to discuss resolving equality deletes in the Flink write
>> path, which will get us one step closer to removing equality deletes
>> from the spec.
>>
>> ## tl;dr
>>
>> We're planning to add an equality delete to deletion vector (DV)
>> conversion to Flink. Equality deletes may remain as an internal
>> intermediary format.
>>
>> ## Background
>>
>> For deletes, Flink currently produces equality delete files.
>>
>> Equality deletes are used to support deletes in the write path, which
>> is a requirement for many use cases like CDC [3]. They are cheap for
>> the writer; it only notes down the to-be-deleted values of the
>> identifier fields inside so-called delete files, and leaves it up for
>> the readers to match the values to the corresponding rows. The heavy
>> lifting has to be done by the readers, which potentially need to scan
>> the entire table to resolve equality deletes.
>>
>> Therefore, equality deletes have been criticized. There are
>> discussions around deprecating / removing them [1].
>>
>> ## Resolving Equality Deletes
>>
>> Steven, Peter, and a few other contributors came up with a proposal to
>> convert equality deletes into DV [2]. The original solution was quite
>> complex, mainly due to the conflict handling between streaming writes,
>> table maintenance, and equality delete resolution. The proposal is
>> also blocked on index support in the Iceberg spec [5].
>>
>> We may need to simplify further to make some progress. The old table
>> specs are going to be around for some time, even after we have a new
>> spec with index support. Users have been asking for a solution to this
>> issue for quite some time [3].
>>
>> The following is a modification of the original design document which
>> adapts the ideas described under "use lock to avoid conflicts" [2].
>>
>> ## Proposed Solution
>>
>> The idea is to add the equality delete to deletion vector (DV)
>> conversion as a Flink table maintenance task. After recent changes, we
>> can now run the writer and the maintenance in the same Flink job and
>> use a Flink-maintained lock to avoid conflicts between the maintenance
>> tasks.
>>
>> 1. Instead of writing directly to the target branch, the writer
>> commits data files + equality deletes to a staging branch.
>> 2. The new "EqualityDeleteResolver" maintenance task reads from the
>> staging branch and converts the equality deletes to DVs using a
>> Flink-maintained primary key index, then commits data files + DVs to
>> the target branch.
>> 3. The existing Flink maintenance framework's lock mechanism ensures
>> mutual exclusion between the convert task and table compaction to
>> avoid conflicts.
>>
>> After conversion, the target branch contains only data files and DVs,
>> no equality deletes.
>>
>> ## Limitations
>>
>> - Readers will only see new data until the conversion is complete.
>> This is partially mitigated by the fact that snapshots with equality
>> deletes cannot be read properly with Flink today [4].
>> - The Flink-maintained index needs to be built initially which
>> requires reading the entire table. We will use Flink's state backend
>> which apart from heap-based storage, also supports spilling to disk
>> via the RocksDB state backend.
>>
>> ## Wrapping up
>>
>> This solution may not be perfect because of the above limitations, but
>> it provides a viable path to free users of the burden of equality
>> deletes, which cannot be read efficiently by most engines today.
>> Eventually, the Flink-maintained index can be replaced by an Iceberg
>> index, which will

Re: [DISCUSS] Flink: Equality delete to DV conversion

2026-03-18 Thread Péter Váry
Thanks, Max, for continuing to push this forward.

The proposal feels like a step in the right direction, but I would like to
see a clearer view of the end goal. As it stands, equality deletes remain
in the spec because the changes are committed to an intermediate branch.
Since the long‑term objective is to remove equality deletes from the
specification altogether, we should be clear about the final solution that
achieves this.

Flink writes will also continue to have the limitation that row lineage is
not maintained correctly. This is unchanged from the current situation, but
I think it’s important to explicitly call this out, or ideally, explore
whether there’s a way to address it.

In addition, concurrent writes and compactions would require updating the
primary key index, which could be expensive.

That said, I don’t see a clearly better alternative at the moment, and
overall this seems like a reasonable way forward.

Thanks again for continuing to drive the proposal.
Peter


On Wed, Mar 18, 2026, 16:47 Maximilian Michels  wrote:

> Hi,
>
> I'd like to discuss resolving equality deletes in the Flink write
> path, which will get us one step closer to removing equality deletes
> from the spec.
>
> ## tl;dr
>
> We're planning to add an equality delete to deletion vector (DV)
> conversion to Flink. Equality deletes may remain as an internal
> intermediary format.
>
> ## Background
>
> For deletes, Flink currently produces equality delete files.
>
> Equality deletes are used to support deletes in the write path, which
> is a requirement for many use cases like CDC [3]. They are cheap for
> the writer; it only notes down the to-be-deleted values of the
> identifier fields inside so-called delete files, and leaves it up for
> the readers to match the values to the corresponding rows. The heavy
> lifting has to be done by the readers, which potentially need to scan
> the entire table to resolve equality deletes.
>
> Therefore, equality deletes have been criticized. There are
> discussions around deprecating / removing them [1].
>
> ## Resolving Equality Deletes
>
> Steven, Peter, and a few other contributors came up with a proposal to
> convert equality deletes into DV [2]. The original solution was quite
> complex, mainly due to the conflict handling between streaming writes,
> table maintenance, and equality delete resolution. The proposal is
> also blocked on index support in the Iceberg spec [5].
>
> We may need to simplify further to make some progress. The old table
> specs are going to be around for some time, even after we have a new
> spec with index support. Users have been asking for a solution to this
> issue for quite some time [3].
>
> The following is a modification of the original design document which
> adapts the ideas described under "use lock to avoid conflicts" [2].
>
> ## Proposed Solution
>
> The idea is to add the equality delete to deletion vector (DV)
> conversion as a Flink table maintenance task. After recent changes, we
> can now run the writer and the maintenance in the same Flink job and
> use a Flink-maintained lock to avoid conflicts between the maintenance
> tasks.
>
> 1. Instead of writing directly to the target branch, the writer
> commits data files + equality deletes to a staging branch.
> 2. The new "EqualityDeleteResolver" maintenance task reads from the
> staging branch and converts the equality deletes to DVs using a
> Flink-maintained primary key index, then commits data files + DVs to
> the target branch.
> 3. The existing Flink maintenance framework's lock mechanism ensures
> mutual exclusion between the convert task and table compaction to
> avoid conflicts.
>
> After conversion, the target branch contains only data files and DVs,
> no equality deletes.
>
> ## Limitations
>
> - Readers will only see new data until the conversion is complete.
> This is partially mitigated by the fact that snapshots with equality
> deletes cannot be read properly with Flink today [4].
> - The Flink-maintained index needs to be built initially which
> requires reading the entire table. We will use Flink's state backend
> which apart from heap-based storage, also supports spilling to disk
> via the RocksDB state backend.
>
> ## Wrapping up
>
> This solution may not be perfect because of the above limitations, but
> it provides a viable path to free users of the burden of equality
> deletes, which cannot be read efficiently by most engines today.
> Eventually, the Flink-maintained index can be replaced by an Iceberg
> index, which will allow for the index to be shared across engines.
>
> What does the community think?
>
> Thanks,
> Max
>
> [1] Deprecate equality deletes:
> https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
> [2] Design doc:
>
> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/edit
> [3] Upserts use case:
> https://lists.apache.org/thread/rt7dmg7l78xpzc9w3lwn090yzqq4fyyw
> [4] Handling upserts downstream:
> https://lists.apach