Re: [DISCUSS] Row Lineage Proposal

Péter Váry Fri, 16 Aug 2024 23:07:13 -0700

Hi Russell,

As discussed offline, this would be very hard to implement with the current
Flink CDC write strategies. I think this is true for every streaming
writers.

For tracking the previous version of the row, the streaming writer would
need to scan the table. It needs to be done for every record to find the
previous version. This could be possible if the data would be stored in a
way which supports fast queries on the primary key, like LSM Tree (see:
Paimon [1]), otherwise it would be prohibitively costly, and unfeasible for
higher loads. So adding a new storage strategy could be one solution.

Alternatively we might find a way for the compaction to update the lineage
fields. We could provide a way to link the equality deletes to the new rows
which updated them during write, then on compaction we could update the
lineage fields based on this info.

Is there any better ideas with Spark streaming which we can adopt?

Thanks,
Peter

[1] - https://paimon.apache.org/docs/0.8/

On Sat, Aug 17, 2024, 01:06 Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Hi Y'all,
>
> We've been working on a new proposal to add Row Lineage to Iceberg in the
> V3 Spec. The general idea is to give every row a unique identifier as well
> as a marker of what version of the row it is. This should let us build a
> variety of features related to CDC, Incremental Processing and Audit
> Logging. If you are interested please check out the linked proposal below.
> This will require compliance from all engines to be really useful so It's
> important we come to consensus on whether or not this is possible.
>
>
> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>
>
> Thank you for your consideration,
> Russ
>

Re: [DISCUSS] Row Lineage Proposal

Reply via email to