Hi folks, I am reaching out to request your insights on addressing the ambiguous behavior of generating changelogs in Iceberg.
To provide some context, Iceberg does not enforce row uniqueness even when configured with identifier fields (a.k.a primary key in the other database system) during write operations. That means that it is possible to have multiple rows with the same identifier fields values. For example, let's consider a table "customer" with columns "id" (int) and "name" (string), and the identifier field set as "id." It is still possible to write multiple rows with the same "id" values, as shown below: (1, 'A') (1, 'B') (2, 'X') (2, 'Y') The CreateChangelogViewProcedure <https://github.com/apache/iceberg/blob/master/docs/spark-procedures.md#change-data-capture> can reconstruct updates based on identifier fields. It works effectively when there is only one row per identifier value. However, handling multiple rows with the same identifier values can be challenging. For example, a `Merge into` or `Update` command can result the following changes: (1, 'a', DELETE) (1, 'b', DELETE) (1, 'c', INSERT) (1, 'd', INSERT) Unfortunately, it is impossible to determine whether "c" or "d" updated "a". For example, both of the following commands are valid even though there is an identifier column id. UPDATE table SET data = 'c' WHERE data = 'a'; UPDATE table SET data = 'd' WHERE data = 'b'; Or UPDATE table SET data = 'd' WHERE data = 'a'; UPDATE table SET data = 'c' WHERE data = 'b'; Due to this uncertainty, we have allowed the procedure to throw an exception in such cases. A relevant pull request can be found [here <https://github.com/apache/iceberg/pull/7388>]. I would appreciate any thoughts or suggestions. Best, Yufei `This is not a contribution`