openinx commented on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-653532308


   @rdblue, our team also had another discussion about the user cases and 
solutions. For the user cases, we have three cases: 
   
   * Case.1: the table has a primary key(the unique ID as you said), and all 
CDC events are INSERT/UPDATE/DELETE operations(the UPDATE event has both old 
values and new values). Such as the classic MySQL binlog events (assuming that 
t have (id, data) columns and id is the primary key): 
   
   ```sql
   INERT INTO t SET @1=1 @2='apple';
   DELETE FROM t WHERE @1=1 @2='apple';
   UPDATE t WHERE @1=1 @2='apple' SET @1=1 @2='pear';
   ```
   
   * Case.2: the table has a primary key, and all CDC events are UPSERT/DELETE, 
the UPSERT don't have to provide old values. such as: 
   
   ```sql
   UPSERT(1, 'apple')
   DELETE(1, 'apple')
   DELETE(1) // DELETE don't have to provide all the column values, the primary 
key is enough.
   ```
   
   * Case.3: the table don't have a primary key, all CDC events are 
INSERT/UPDATE/DELETE(UPDATE should provide both old values and new values). 
Supporting UPSERT without old values is impossible.
   
   The three cases should match the CDC requirements from users, and we have 
the three goals to meet the requirements: 1) Fast ingestion; 2) Acceptable 
Batch Read performance; 3) an equivalent stream so that we could keep the 
eventual consistency between source table and sink table. 
   
   Now, we have three solutions: 
   
   1. The mixed equality-deletes and pos-deletes solution as you proposed. 
   2. The pure postional-deletes I described in the document.
   3. The incremental log files solution. 
   
   In summary, the solution#2 and solution#3 resovled the before-after order 
for downstream, but they both have few issues about batch read performance. The 
solution #2 will need to check whether there exist a delete row whose 
`<file_id, offset>` isn't less than the candinate row's in INSERT data files, 
it will use a data structure like `HashMap<primaryKey, TreeSet<T>>`(maintain 
those delete marker) to implement the `JOIN` algorithm. It don't need any 
`map<primary_key, pos>` when writing but solution#1 need, while the solution#1 
is more simple when considering merge read.
   
   The solution#3 will need to do `JOIN` between the newly written INSERT while 
the solution#1 will only `JOIN` between INSERT and DELETE. 
   
   For solution#1, the problem is: how to reconstruct the stream events which 
could replay to downstream corrently ?  We could take an example, the CDC 
events: 
   
   ```
   t0, INSERT(1,2)
   t1, DELETE(2,2)
   t2, UPSERT(1,3)
   (The first field is primary key)
   ```
   
   It will generate three files: 
   
   * the INSERT file will have: 
   
   ```
   INSERT(1,2)
   INSERT(1,3)
   ```
   
   * the equality-delete file will have: 
   
   ```
   DELETE(2,2)
   ```
   
   * the postional-delete file will have: 
   
   ```
   <data-file=insert-file-id, position=0, primary_key_column=1>
   ```
   
   NOTICE: here we add the `primary_key_column=1` in positional-delete file 
because if it's `UPSERT` then the DELETE(1,2) should also mask the data files 
who have a smaller sequence number. Keeping the primary key value here could 
avoid to searching the original insert data file for comparing primary key when 
doing `JOIN` between data file and delete file.
   
   Now let's consider the replay issue again, we could regard the 
equality-delete file are deleting existed rows before this transaction, so it 
should be replayed firstly. Then we could restore the correct event order by 
JOIN INSERT file and positional-delete file. In theory, we could maintain the 
eventual consistency between source table and sink table because we gurantee: 
   
   1. Events of the same row will be consumed exactly as the order they are 
written in the source table.
   2. We don’t guarantee the consumption order of different rows.
   3. If an UPDATE event updates from one row to another row, then we don’t 
guarantee the before-after order between two rows. 
   
   So seems the solution#1 could be a candidate solution for the primary key 
case#1 and case#2.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to