aokolnychyi edited a comment on issue #3941: URL: https://github.com/apache/iceberg/issues/3941#issuecomment-1064596725
> So having an Update record type would help to segregate it from Insert and Delete records in more convenient way. The problem right now is that there is no update in Iceberg. There are only inserts and deletes. An update is represented as a delete followed by an insert. That being said, there may be a way to construct an update record given delete and insert records. For example, we can shuffle the delete/insert records so that all record types for the same identity columns are next to each. ``` delete, s1, 100, null insert, s1, 100, 1 ``` In this case, we can construct a post update image. I am not sure how we can construct a pre update image without joining the delete record with the target table (that's going to be expensive). We can do that more or less efficiently for position deletes and copy-on-write but equality deletes may include only values for identity columns. We will have to scan a lot of data to reconstruct a pre update image for equality deletes. This would only work if the identity columns are not modified. > Also, I noticed that in example 2 above, CDC records are generated for unchanged records (id=106). For Copy-On-Write tables, would this be the behaviour of CDC? This one is a little bit easier. To start with, we can report unchanged rows as it is exactly what happens in the table. Whenever we rewrite a file in copy-on-write, we delete all rows from that file and add new records where some records can be simply copied over. In the future, we can use the above idea and co-locate entries for the same identity columns. Then we can remove pairs where a record is deleted and added without any changes. This won't require any joins with the target table so won't be that expensive. Maybe, the action can have an option to perform this deduplication. That way, rows that were copied over in copy-on-write won't be part of the output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
