[GitHub] [iceberg] aokolnychyi commented on issue #3941: [Feature Request] Support for change data capture

GitBox Thu, 10 Mar 2022 21:02:51 -0800


aokolnychyi commented on issue #3941:
URL: https://github.com/apache/iceberg/issues/3941#issuecomment-1064774980



   After thinking more, we may change the assumptions a bit.
   
   ### Revisited Assumptions
   
   - Changes are consumed per snapshot
   
   Iceberg does not distinguish the order of records within a snapshot. That's 
why Iceberg can only report a summary of what changed in a snapshot. 
   
   By default, rows added and removed in the same snapshot (e.g. Flink CDC) are 
not shown in the output. In most cases, we just want to see a net delta per 
snapshot. If a use case requires all records to be seen, it is still doable. 
For example, if a snapshot adds a data file A with two records (pos 0, 1) and a 
position delete file D that removes pos 0, we can output something like this.
   
   ```
    _record_type | _commit_snapshot_id | _commit_order | col1 | col2
   -------------------------------------------------------------------
   insert, s1, 0, 100, a
   insert, s1, 0, 101, a
   delete, s1, 1, 100, null
   ```
   
   That means records within the same snapshot may have different 
`_commit_order`. Seems a little bit odd but kind of represents what happens in 
the Iceberg table. Audit use cases may need this.
   
   - Output only delete and insert record types by default
   
   Iceberg does not have a notion of an update, which means constructing 
pre/post images will require some computation. In a lot of cases, this won't be 
needed (e.g. refresh of a materialized view, syncing changes with an external 
system). I think we should prefer a more efficient algorithm if possible. 
   
   There seems to be a way to build pre/post update images (see the comment 
above) but it will require joins and equality deletes will be the trickiest. 
For instance, a single equality delete may match a number of records and we 
have to report all of them.
   
   - Table must have identity columns defined
   
   Whenever we apply a delta log to a target table, most CDC use cases rely on 
a primary key or a set of identity columns. I think it is reasonable to assume 
Iceberg tables should have identity columns defined to support generation of 
CDC records.
   
   - Delete records may only include values for identity columns
   
   It is not required to output the entire deleted record if other columns are 
not stored in equality delete files. If values for other columns are not 
persisted, this would require an expensive join to reconstruct the entire row.
   
   ### Open questions
   
   - How to deal with changing identity columns?
   - How to deal with cases when an equality delete is issued using a set of 
columns that is different from identity columns?
   
   ### MVP
   
   - Implement a Spark action that would output delete/update record types per 
snapshot.
   
   ### Future
   
   - Option for outputting records added and removed in the same snapshot.
   - Option for pre/post update images.
   - Different ways to consume these changes (e.g. cdc metadata table).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on issue #3941: [Feature Request] Support for change data capture

Reply via email to