openinx commented on issue #360: URL: https://github.com/apache/iceberg/issues/360#issuecomment-651531457
> I have been thinking of an upsert case, where we get an entire replacement row but not the previous row values. The `upsert` without the previous row values depends on the primary key specification, otherwise an `UPSERT(1,2)` don't know which row should it change. The primary key will also introduce other questions: 1. whether the primary key columns should include the partition column and bucket column ? if not, then a table with (a,b) columns, a is the primary key and b is the partition key. for an `UPSERT(1,2)` its old row could be in any partitions of the iceberg table, we will need broadcast to all partition ? That seems resource wasting. If sure, then it will limit the usage, for example we may adjust the bucket policy based on the fields to be JOIN between two table to avoid the massive data shuffle. So when adjust the bucket policy we will also need to adjust the primary key to include the new bucket columns ? 2. How to ensure the uniqueness in iceberg table ? Within a given bucket, two different version of the same row may appear in two different data files, then will we need the costly `JOIN` between data files (Previously, we only need to JOIN between data files and delete files); Within a given iceberg table, if primary key don't include the partition column, then two rows with the same primary key may appear in two different partition, the pk deduplication is a problem too. In my opinion, CDC events from RDBMS should always provide both old values and new values (such as `row-level` binlog). It will be good to handle this case correctly first. About the other CDC events, such as `UPSERT` with only primary key, it seems need more consideration. > Using a unique ID for each CDC event could mitigate the problem, but duplicate events from at-least-once processing would still incorrectly delete data rows. How would you avoid this problem? For spark streaming, it only guarantee the `at-least-once` semantics. Seems hard to maintain the consistent data in sink table unless we provide a identified global timestamp to indicate the before-after order. Let me think more about this. For flink, it provide the `exactly-once` semantics, so it seems easier. > but I'm not sure it would enable a source to replay exactly the events that were received -- after all, we're still separating inserts and deletes into separate files so we can't easily reconstruct an upsert event. I think the mixed equality-deletion/positional-deletion you described seems hard to reconstruct the correct event order. The pure positional-deletion described in this [document](https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#heading=h.ljqc7bxmc6ej) could restore the original order. > I don't think that it makes a case for dynamic columns. Yes, I agree that we don't need to care about the dynamic columns now. In the real CDC events, it always provide all column values for a row. Please ignore the propose about the dynamic columns encoding. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
