aokolnychyi edited a comment on issue #3941:
URL: https://github.com/apache/iceberg/issues/3941#issuecomment-1064596725


   > So having an Update record type would help to segregate it from Insert and 
Delete records in more convenient way.
   
   The problem right now is that there is no update in Iceberg. There are only 
inserts and deletes. An update is represented as a delete followed by an 
insert. That being said, there may be a way to construct an update record given 
delete and insert records. For example, we can shuffle the delete/insert 
records so that all record types for the same identity columns are next to 
each. 
   
   ```
   delete, s1, 100, null
   insert, s1, 100, 1
   ```
   
   In this case, we can construct a post update image. I am not sure how we can 
construct a pre update image without joining the delete record with the target 
table (that's going to be expensive). We can do that more or less efficiently 
for position deletes and copy-on-write but equality deletes may include only 
values for identity columns. We will have to scan a lot of data to reconstruct 
a pre update image for equality deletes.
   
   This would only work if the identity columns are not modified.
   
   > Also, I noticed that in example 2 above, CDC records are generated for 
unchanged records (id=106). For Copy-On-Write tables, would this be the 
behaviour of CDC?
   
   This one is a little bit easier. To start with, we can report unchanged rows 
as it is exactly what happens in the table. Whenever we rewrite a file in 
copy-on-write, we delete all rows from that file and add new records where some 
records can be simply copied over. In the future, we can use the above idea and 
co-locate entries for the same identity columns. Then we can remove pairs where 
a record is deleted and added without any changes. This won't require any joins 
with the target table so won't be that expensive. Maybe, the action can have an 
option to perform this deduplication. That way, rows that were copied over in 
copy-on-write won't be part of the output.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to