Re: Best practices for streaming upserts into Iceberg tables

melin li Tue, 20 Jan 2026 23:48:46 -0800

Flink CDC support reading binlog data from databases such as MySQL and
PostgreSQL, and writing it to Iceberg, Hudi, and Paimon.
https://github.com/apache/flink-cdc/pulls?q=iceberg


Steven Wu <[email protected]> 于2026年1月21日周三 15:27写道：

> Lu,
>
> you are correct about the design doc for Flink writing position deletes
> only. The original design has high complexity. We were thinking about
> alternatives with narrower scope. But there isn't any progress and timeline
> .
>
> IMHO, your setup is a good practice today. Ryan wrote a series of blogs
> for the pattern:
> https://tabular.medium.com/hello-world-of-cdc-e6f06ddbfcc0.
>
> Some people use the current Flink Iceberg sink for CDC ingestion. But it
> would produce equality deletes that would require aggressive compactions
> and add operational burden too. Also not all engines can read equality
> deletes.
>
> Thanks,
> Steven
>
> On Tue, Jan 20, 2026 at 8:44 PM Gang Wu <[email protected]> wrote:
>
>> Hi Lu,
>>
>> Nice to hear from you here in the Iceberg community :)
>>
>> We have built an internal service to stream upserts into position deletes
>> which happens to have a lot in common with [1] and [2]. I believe this is a
>> viable approach to achieve second freshness.
>>
>> [1]
>> https://docs.google.com/document/d/1Jz4Fjt-6jRmwqbgHX_u0ohuyTB9ytDzfslS7lYraIjk
>> [2] https://www.mooncake.dev/whitepaper
>>
>> Best,
>> Gang
>>
>>
>>
>>
>> On Wed, Jan 21, 2026 at 11:05 AM Lu Niu <[email protected]> wrote:
>>
>>> Hi Iceberg community,
>>>
>>> What are the current best practices for streaming upserts into an
>>> Iceberg table?
>>>
>>> Today, we have the following setup in production to support CDC:
>>>
>>> 1. A Flink job that continuously appends CDC events into an append-only
>>> raw table
>>> 2, A periodically scheduled Spark job that performs upsert the `current`
>>> table using `raw` table
>>>
>>> We are exploring whether it’s feasible to stream upserts directly into
>>> an Iceberg table from Flink. This could simplify our architecture and
>>> potentially further reduce our data SLA. We’ve experimented with this
>>> approach before, but ran into reader-side performance issues due to the
>>> accumulation of equality deletes over time.
>>>
>>> From what I can gather, streaming upserts still seems to be an open
>>> design area:
>>>
>>> 1. (Please correct me if I’m wrong—this summary is partly based on
>>> ChatGPT 5.1.) The book “Apache Iceberg: The Definitive Guide” suggests the
>>> two-table pattern we’re currently using in production.
>>> 2.  These threads:
>>> https://lists.apache.org/thread/gjjr30txq318qp6pff3x5fx1jmdnr6fv ,
>>> https://lists.apache.org/thread/xdkzllzt4p3tvcd3ft4t7jsvyvztr41j discuss
>>> the idea of outputting only positional deletes (no equality deletes) by
>>> introducing an index. However, this appears to still be under discussion
>>> and may be targeted for v4, with no concrete timeline yet.
>>> 3. this thread
>>> https://lists.apache.org/thread/6fhpjszsfxd8p0vfzc3k5vw7zmcyv2mq talks
>>> about deprecating equality deletes, but I haven’t seen a clearly defined
>>> alternative come out of that discussion yet.
>>>
>>> Given all of the above, I’d really appreciate guidance from the
>>> community on:
>>>
>>> 1. Recommended patterns for streaming upserts with Flink into Iceberg
>>> today (it's good to know the long term possible as well, but my focus is
>>> what's possible in near term).
>>> 2. Practical experiences or lessons learned from teams running streaming
>>> upserts in production
>>>
>>> Thanks in advance for any insights and corrections.
>>>
>>> Best
>>> Lu
>>>
>>

Re: Best practices for streaming upserts into Iceberg tables

Reply via email to