RE: Re: [DISCUSS] Flink: Equality delete to DV conversion

Qian Su Tue, 12 May 2026 15:57:01 -0700

Thanks for driving this proposal forward! Our organization is still largely 
running on V2 tables. Because migrating our entire footprint to V3 will take 
some time, having a V2-compatible conversion path would be incredibly valuable 
for our use cases. I would be very interested in contributing to the V2 path to 
help support this. Please let me know where it makes the most sense for me to 
jump in—whether collaborating directly on the current PR or picking it up as a 
follow-up once the initial V3 foundation lands.


Qian

On 2026/04/17 13:58:28 Maximilian Michels wrote:
> Hi Steven,
> 
> Thanks for chiming in! Yes, the current solution requires V3 tables.
> I'll probably have to make that clearer in the PR description.
> 
> In principle, it wouldn't be hard to make it work with V2 either,
> obviously with the tradeoffs that come along with it in terms of
> storage / lookup efficiency.
> 
> Cheers,
> Max
> 
> On Thu, Apr 16, 2026 at 5:30 PM Steven Wu <[email protected]> wrote:
> >
> > Sorry for the delayed response. I agree that this proposal could get us one 
> > step closer to removing equality deletes. It can also verify whether the 
> > index solution can scale for streaming CDC/Upsert writes.
> >
> > > For this PR, we opted to use deletion vectors over regular delete files 
> > > due to their efficiency in terms of space and lookup.
> >
> > I didn't quite get this from the PR description. Does this mean it is 
> > limited to V3 tables?
> >
> > Thanks for working on this, Max!
> >
> > On Thu, Apr 16, 2026 at 8:03 AM Maximilian Michels <[email protected]> wrote:
> >>
> >> Thanks everyone for the discussion and support.
> >>
> >> I've opened a PR which implements what we discussed here:
> >> https://github.com/apache/iceberg/pull/15996
> >>
> >> Just to cap:
> >>
> >> The EqualityDeleteConverter (EDC) will be running in-line with the
> >> writer, which produces the equality deletes to a staging branch. We
> >> monitor the staging branch for new commits, build a sharded primary
> >> key index in Flink state (backed by RocksDB for large tables), resolve
> >> equality deletes against the index, and commit back the resulting data
> >> files + DVs to the main branch.
> >>
> >> The PR is split into several commits, which we can break out into
> >> separate PRs for easier review. There are some limitations and
> >> follow-ups listed in the PR. The biggest gaps are preserving row
> >> lineage and lifecycle management of the staging branch. In a
> >> follow-up, we will also add integration with the Flink IcebergSink. I
> >> tried to keep the scope limited to the EDC maintenance task for the
> >> first PR.
> >>
> >> Cheers,
> >> Max
> >>
> >> On Thu, Apr 2, 2026 at 9:43 AM Márton Balassi <[email protected]> wrote:
> >> >
> >> > Thanks for raising this, Max and for the feedback Peter and Manu.
> >> >
> >> > I am supportive of this proposal, especially with the clearly defined 
> >> > vision of eventually completely removing the need for equality deletes.
> >> >
> >> > Lifting the reliance on equality deletes in the Flink write path would 
> >> > be a significant improvement, both in terms of read efficiency (by 
> >> > moving towards delete vectors) and in terms of new capabilities, as it 
> >> > would make writing upserts from Flink a viable path to explore going 
> >> > forward.
> >> >
> >> > Cheers,
> >> > Marton
> >> >
> >> > On 2026/03/20 11:15:26 Maximilian Michels wrote:
> >> > > Thanks Peter and Manu for the feedback.
> >> > >
> >> > > @Peter:
> >> > >
> >> > > Good point on the end goal. The end goal should be to completely
> >> > > remove equality deletes.
> >> > >
> >> > > While the staging branch, which contains the equality deletes, is an
> >> > > internal implementation detail of the Flink writer, it will still be
> >> > > accessible via the Iceberg reader API. For the transition period, I
> >> > > think this has several advantages:
> >> > > 1. We don't need to fundamentally change the write logic of existing 
> >> > > writers.
> >> > > 2. We still allow for the data to be inspected before converting it
> >> > > and merging it to the main branch. This is also helpful for
> >> > > troubleshooting.
> >> > >
> >> > > The staging branch solution is a first step towards removing equality 
> >> > > deletes.
> >> > >
> >> > > In V4, we could already deprecate equality deletes. Once the spec
> >> > > includes indices, we can move the index into Iceberg, which should
> >> > > make it easier to develop an in-place resolution of equality deletes
> >> > > supporting multiple writers and conflict resolution. Admittedly, we
> >> > > haven't fully figured out the best in-place approach. I think it is a
> >> > > good idea to take it one step at a time.
> >> > >
> >> > > On row lineage: If we want to preserve the row id of updated rows, we
> >> > > will have to store the row id in the primary key index. Theoretically,
> >> > > we should be able to then add it to the corresponding new row. The
> >> > > question is how to do that efficiently, such that we don't have to
> >> > > rewrite any data files. We would need some way to map the row id of
> >> > > the newly inserted row to the row id of the deleted row. Do we already
> >> > > have such functionality in Iceberg?
> >> > >
> >> > > On concurrent writes: For the time being, I think we should not allow
> >> > > concurrent maintenance tasks, including equality delete conversion.
> >> > > Concurrent writes are still supported, as long as they go to the
> >> > > staging branch.
> >> > >
> >> > > @Manu:
> >> > >
> >> > > +1 to Peter's response. The primary key index is bounded and
> >> > > independent of the number of accumulated equality deletes, so memory
> >> > > doesn't blow up, as long as we have sufficient resources to load the
> >> > > index. We definitely cannot rely on the full index to fit into memory.
> >> > > Fortunately, Flink is already prepared for this; it supports spilling
> >> > > to disk via its RocksDB state backend.
> >> > >
> >> > > Cheers,
> >> > > Max
> >> > >
> >> > > On Fri, Mar 20, 2026 at 11:07 AM Péter Váry <[email protected]> wrote:
> >> > > >
> >> > > > Equality delete resolution could be made significantly more 
> >> > > > efficient by using an index (e.g., backed by RocksDB) to store the 
> >> > > > current mapping from primary keys to (file, position). While the 
> >> > > > memory footprint would not be small, it would be bounded and 
> >> > > > independent of the number of accumulated equality deletes. In 
> >> > > > addition, even a blocking compaction should block for a shorter 
> >> > > > period than the typical interval at which table compactions are 
> >> > > > scheduled.
> >> > > >
> >> > > > Manu Zhang <[email protected]> ezt írta (időpont: 2026. márc. 19., Cs, 
> >> > > > 15:57):
> >> > > >>
> >> > > >> Thanks Max for the proposal. One question here.
> >> > > >> When the convert task can not finish in time (e.g. blocked by 
> >> > > >> compactions), and equality deletes accumulate on the staging 
> >> > > >> branch, will we have the same issue as loading too many equality 
> >> > > >> deletes and blowing up memory?
> >> > > >>
> >> > > >> Regards,
> >> > > >> Manu
> >> > > >>
> >> > > >> On Thu, Mar 19, 2026 at 2:45 PM Péter Váry <[email protected]> wrote:
> >> > > >>>
> >> > > >>> Thanks, Max, for continuing to push this forward.
> >> > > >>>
> >> > > >>> The proposal feels like a step in the right direction, but I would 
> >> > > >>> like to see a clearer view of the end goal. As it stands, equality 
> >> > > >>> deletes remain in the spec because the changes are committed to an 
> >> > > >>> intermediate branch. Since the long‑term objective is to remove 
> >> > > >>> equality deletes from the specification altogether, we should be 
> >> > > >>> clear about the final solution that achieves this.
> >> > > >>>
> >> > > >>> Flink writes will also continue to have the limitation that row 
> >> > > >>> lineage is not maintained correctly. This is unchanged from the 
> >> > > >>> current situation, but I think it’s important to explicitly call 
> >> > > >>> this out, or ideally, explore whether there’s a way to address it.
> >> > > >>>
> >> > > >>> In addition, concurrent writes and compactions would require 
> >> > > >>> updating the primary key index, which could be expensive.
> >> > > >>>
> >> > > >>> That said, I don’t see a clearly better alternative at the moment, 
> >> > > >>> and overall this seems like a reasonable way forward.
> >> > > >>>
> >> > > >>> Thanks again for continuing to drive the proposal.
> >> > > >>> Peter
> >> > > >>>
> >> > > >>>
> >> > > >>> On Wed, Mar 18, 2026, 16:47 Maximilian Michels <[email protected]> 
> >> > > >>> wrote:
> >> > > >>>>
> >> > > >>>> Hi,
> >> > > >>>>
> >> > > >>>> I'd like to discuss resolving equality deletes in the Flink write
> >> > > >>>> path, which will get us one step closer to removing equality 
> >> > > >>>> deletes
> >> > > >>>> from the spec.
> >> > > >>>>
> >> > > >>>> ## tl;dr
> >> > > >>>>
> >> > > >>>> We're planning to add an equality delete to deletion vector (DV)
> >> > > >>>> conversion to Flink. Equality deletes may remain as an internal
> >> > > >>>> intermediary format.
> >> > > >>>>
> >> > > >>>> ## Background
> >> > > >>>>
> >> > > >>>> For deletes, Flink currently produces equality delete files.
> >> > > >>>>
> >> > > >>>> Equality deletes are used to support deletes in the write path, 
> >> > > >>>> which
> >> > > >>>> is a requirement for many use cases like CDC [3]. They are cheap 
> >> > > >>>> for
> >> > > >>>> the writer; it only notes down the to-be-deleted values of the
> >> > > >>>> identifier fields inside so-called delete files, and leaves it up 
> >> > > >>>> for
> >> > > >>>> the readers to match the values to the corresponding rows. The 
> >> > > >>>> heavy
> >> > > >>>> lifting has to be done by the readers, which potentially need to 
> >> > > >>>> scan
> >> > > >>>> the entire table to resolve equality deletes.
> >> > > >>>>
> >> > > >>>> Therefore, equality deletes have been criticized. There are
> >> > > >>>> discussions around deprecating / removing them [1].
> >> > > >>>>
> >> > > >>>> ## Resolving Equality Deletes
> >> > > >>>>
> >> > > >>>> Steven, Peter, and a few other contributors came up with a 
> >> > > >>>> proposal to
> >> > > >>>> convert equality deletes into DV [2]. The original solution was 
> >> > > >>>> quite
> >> > > >>>> complex, mainly due to the conflict handling between streaming 
> >> > > >>>> writes,
> >> > > >>>> table maintenance, and equality delete resolution. The proposal is
> >> > > >>>> also blocked on index support in the Iceberg spec [5].
> >> > > >>>>
> >> > > >>>> We may need to simplify further to make some progress. The old 
> >> > > >>>> table
> >> > > >>>> specs are going to be around for some time, even after we have a 
> >> > > >>>> new
> >> > > >>>> spec with index support. Users have been asking for a solution to 
> >> > > >>>> this
> >> > > >>>> issue for quite some time [3].
> >> > > >>>>
> >> > > >>>> The following is a modification of the original design document 
> >> > > >>>> which
> >> > > >>>> adapts the ideas described under "use lock to avoid conflicts" 
> >> > > >>>> [2].
> >> > > >>>>
> >> > > >>>> ## Proposed Solution
> >> > > >>>>
> >> > > >>>> The idea is to add the equality delete to deletion vector (DV)
> >> > > >>>> conversion as a Flink table maintenance task. After recent 
> >> > > >>>> changes, we
> >> > > >>>> can now run the writer and the maintenance in the same Flink job 
> >> > > >>>> and
> >> > > >>>> use a Flink-maintained lock to avoid conflicts between the 
> >> > > >>>> maintenance
> >> > > >>>> tasks.
> >> > > >>>>
> >> > > >>>> 1. Instead of writing directly to the target branch, the writer
> >> > > >>>> commits data files + equality deletes to a staging branch.
> >> > > >>>> 2. The new "EqualityDeleteResolver" maintenance task reads from 
> >> > > >>>> the
> >> > > >>>> staging branch and converts the equality deletes to DVs using a
> >> > > >>>> Flink-maintained primary key index, then commits data files + DVs 
> >> > > >>>> to
> >> > > >>>> the target branch.
> >> > > >>>> 3. The existing Flink maintenance framework's lock mechanism 
> >> > > >>>> ensures
> >> > > >>>> mutual exclusion between the convert task and table compaction to
> >> > > >>>> avoid conflicts.
> >> > > >>>>
> >> > > >>>> After conversion, the target branch contains only data files and 
> >> > > >>>> DVs,
> >> > > >>>> no equality deletes.
> >> > > >>>>
> >> > > >>>> ## Limitations
> >> > > >>>>
> >> > > >>>> - Readers will only see new data until the conversion is complete.
> >> > > >>>> This is partially mitigated by the fact that snapshots with 
> >> > > >>>> equality
> >> > > >>>> deletes cannot be read properly with Flink today [4].
> >> > > >>>> - The Flink-maintained index needs to be built initially which
> >> > > >>>> requires reading the entire table. We will use Flink's state 
> >> > > >>>> backend
> >> > > >>>> which apart from heap-based storage, also supports spilling to 
> >> > > >>>> disk
> >> > > >>>> via the RocksDB state backend.
> >> > > >>>>
> >> > > >>>> ## Wrapping up
> >> > > >>>>
> >> > > >>>> This solution may not be perfect because of the above 
> >> > > >>>> limitations, but
> >> > > >>>> it provides a viable path to free users of the burden of equality
> >> > > >>>> deletes, which cannot be read efficiently by most engines today.
> >> > > >>>> Eventually, the Flink-maintained index can be replaced by an 
> >> > > >>>> Iceberg
> >> > > >>>> index, which will allow for the index to be shared across engines.
> >> > > >>>>
> >> > > >>>> What does the community think?
> >> > > >>>>
> >> > > >>>> Thanks,
> >> > > >>>> Max
> >> > > >>>>
> >> > > >>>> [1] Deprecate equality deletes:
> >> > > >>>> https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
> >> > > >>>> [2] Design doc:
> >> > > >>>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk/edit
> >> > > >>>> [3] Upserts use case:
> >> > > >>>> https://lists.apache.org/thread/rt7dmg7l78xpzc9w3lwn090yzqq4fyyw
> >> > > >>>> [4] Handling upserts downstream:
> >> > > >>>> https://lists.apache.org/thread/bx1ntfr45c0g9rh643yw7w8znv6wtrno
> >> > > >>>> [5] V4 Index: 
> >> > > >>>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j
> >> > >
>

RE: Re: [DISCUSS] Flink: Equality delete to DV conversion

Reply via email to