aokolnychyi commented on issue #3941:
URL: https://github.com/apache/iceberg/issues/3941#issuecomment-1064774980
After thinking more, we may change the assumptions a bit.
### Revisited Assumptions
- Changes are consumed per snapshot
Iceberg does not distinguish the order of records within a snapshot. That's
why Iceberg can only report a summary of what changed in a snapshot.
By default, rows added and removed in the same snapshot (e.g. Flink CDC) are
not shown in the output. In most cases, we just want to see a net delta per
snapshot. If a use case requires all records to be seen, it is still doable.
For example, if a snapshot adds a data file A with two records (pos 0, 1) and a
position delete file D that removes pos 0, we can output something like this.
```
_record_type | _commit_snapshot_id | _commit_order | col1 | col2
-------------------------------------------------------------------
insert, s1, 0, 100, a
insert, s1, 0, 101, a
delete, s1, 1, 100, null
```
That means records within the same snapshot may have different
`_commit_order`. Seems a little bit odd but kind of represents what happens in
the Iceberg table. Audit use cases may need this.
- Output only delete and insert record types by default
Iceberg does not have a notion of an update, which means constructing
pre/post images will require some computation. In a lot of cases, this won't be
needed (e.g. refresh of a materialized view, syncing changes with an external
system). I think we should prefer a more efficient algorithm if possible.
There seems to be a way to build pre/post update images (see the comment
above) but it will require joins and equality deletes will be the trickiest.
For instance, a single equality delete may match a number of records and we
have to report all of them.
- Table must have identity columns defined
Whenever we apply a delta log to a target table, most CDC use cases rely on
a primary key or a set of identity columns. I think it is reasonable to assume
Iceberg tables should have identity columns defined to support generation of
CDC records.
- Delete records may only include values for identity columns
It is not required to output the entire deleted record if other columns are
not stored in equality delete files. If values for other columns are not
persisted, this would require an expensive join to reconstruct the entire row.
### Open questions
- How to deal with changing identity columns?
- How to deal with cases when an equality delete is issued using a set of
columns that is different from identity columns?
### MVP
- Implement a Spark action that would output delete/update record types per
snapshot.
### Future
- Option for outputting records added and removed in the same snapshot.
- Option for pre/post update images.
- Different ways to consume these changes (e.g. cdc metadata table).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]