openinx commented on issue #360: URL: https://github.com/apache/iceberg/issues/360#issuecomment-645207673
> What are the use cases for a format that allows a dynamic column set for every row? Take the GDPR user case as example, such as a user with many properties, and there're few deletions which need to delete several properties combination, such as `(a,b)`, `(a,b,c)`, `(a,b)`. > In that case, it's easier if the schema of the delete file is a data file with just the primary key columns. So if I had a users table, I might have a delete file with a single column, user_id with a field ID that matches the data files. Think about the design again, assume the CDC case that we have a table with (id, a) two columns, and `id` is primary key. and the CDC events come as the orders in an iceberg transaction: ``` t0, INSERT(1,2); t1, DELETE(1,2); t2, INSERT(1,3); ``` As you said, we will produce two different files for this transaction, one is data file and another one is delete differential file. so the data file will have: ``` (1,2); (1,3); ``` and deletion file will only have the primary key column (as you said if I understand correctly): ``` (1); ``` When we do the merge on read, I guess both `(1,2)` and `(1,3)` will be deleted by the `(1)`, while actually we should return the row `(1,3)` because the `DELETE(1,2)` should only remove the record `INSERT(1,2)`. So I'm thinking that for equality-delete, we will need to keep all columns in the delete differential files so that we could get ride of this issue (say INSERT and DELETE the same record few times in a single iceberg transaction). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
