openinx commented on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-651531457


   > I have been thinking of an upsert case, where we get an entire replacement 
row but not the previous row values. 
   
   The `upsert` without the previous row values depends on the primary key 
specification, otherwise an `UPSERT(1,2)` don't know which row should it 
change. The primary key will also introduce other questions: 
   1.   whether the primary key columns should include the partition column and 
bucket column ? if not, then a table with (a,b) columns, a is the primary key 
and b is the partition key. for an `UPSERT(1,2)` its old row could be in any 
partitions of the iceberg table, we will need broadcast to all partition ? That 
seems resource wasting.  If sure, then it will limit the usage, for example we 
may adjust the bucket policy based on the fields to be JOIN between two table 
to avoid the massive data shuffle. So when adjust the bucket policy we will 
also need to adjust the primary key to include the new bucket columns ? 
   2.  How to ensure the uniqueness in iceberg table ?  Within a given bucket,  
two different version of the same row may appear  in two different data files, 
then will we need the costly  `JOIN` between data files (Previously, we only 
need to JOIN between data files and delete files);  Within a given iceberg 
table,  if primary key don't include the partition column, then two rows with 
the same primary key may appear in two different partition, the pk 
deduplication is a problem too.
   
   In my opinion,  CDC events from RDBMS should always provide both old values 
and new values (such as `row-level` binlog).  It will be good to handle this 
case correctly first. About the other CDC events, such as `UPSERT` with only 
primary key, it seems need more consideration.
   
   > Using a unique ID for each CDC event could mitigate the problem, but 
duplicate events from at-least-once processing would still incorrectly delete 
data rows. How would you avoid this problem?
   
   For spark streaming, it only guarantee the `at-least-once` semantics. Seems 
hard to maintain the consistent data in sink table unless we provide a 
identified global timestamp to indicate the before-after order. Let me think 
more about this.  For flink, it provide the `exactly-once` semantics, so it 
seems easier.
   
   > but I'm not sure it would enable a source to replay exactly the events 
that were received -- after all, we're still separating inserts and deletes 
into separate files so we can't easily reconstruct an upsert event.
   
   I think the mixed equality-deletion/positional-deletion you described seems 
hard to reconstruct the correct event order.  The pure positional-deletion 
described in this 
[document](https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#heading=h.ljqc7bxmc6ej)
  could restore the original order.  
   
   >  I don't think that it makes a case for dynamic columns.
   
   Yes, I agree that we don't need to care about the dynamic columns now.  In 
the real CDC events, it always provide all column values for a row.  Please 
ignore the propose about the dynamic columns encoding.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to