Hi folks,

I am reaching out to request your insights on addressing the ambiguous
behavior of generating changelogs in Iceberg.

To provide some context, Iceberg does not enforce row uniqueness even when
configured with identifier fields (a.k.a primary key in the other database
system) during write operations. That means that it is possible to have
multiple rows with the same identifier fields values. For example, let's
consider a table "customer" with columns "id" (int) and "name" (string),
and the identifier field set as "id." It is still possible to write
multiple rows with the same "id" values, as shown below:

(1, 'A')
(1, 'B')
(2, 'X')
(2, 'Y')

The CreateChangelogViewProcedure
<https://github.com/apache/iceberg/blob/master/docs/spark-procedures.md#change-data-capture>
can reconstruct updates based on identifier fields. It works effectively
when there is only one row per identifier value. However, handling multiple
rows with the same identifier values can be challenging. For example, a
`Merge into` or `Update` command can result the following changes:

(1, 'a', DELETE)
(1, 'b', DELETE)
(1, 'c', INSERT)
(1, 'd', INSERT)

Unfortunately, it is impossible to determine whether "c" or "d" updated
"a". For example, both of the following commands are valid even though
there is an identifier column id.

UPDATE table SET data = 'c' WHERE data = 'a';
UPDATE table SET data = 'd' WHERE data = 'b';

Or

UPDATE table SET data = 'd' WHERE data = 'a';
UPDATE table SET data = 'c' WHERE data = 'b';

Due to this uncertainty, we have allowed the procedure to throw an
exception in such cases. A relevant pull request can be found [here
<https://github.com/apache/iceberg/pull/7388>]. I would appreciate any
thoughts or suggestions.

Best,

Yufei

`This is not a contribution`

Reply via email to