openinx commented on issue #2610:
URL: https://github.com/apache/iceberg/issues/2610#issuecomment-844595929


   Did you have multiple parallelism in flink job to write the same keys to 
unpartitioned table ?   Assume there's two operations: 
   
   ```
   1.   INSERT key1 value1; 
   2.   DELETE key1 value1 ; 
   3.   INSERT key1 value2 ; 
   ```
   
   As we have 2 parallelism to write those rows ( without shuffling by primary 
key),  then it's possible that:  The first parallelism  accept the event2 ( 
DELETE key1 value1) and write to the iceberg table,  the second parallelism 
accept the event1 and event3 and write to the iceberg table.  Then finally the 
`DELETE key1 value1` won't mask the event1 & event3 because it happens before 
them and it would only delete all those events with the same key that happens 
before the delete.  In this case,  we will encounter two duplicate `INSERT 
key1` with different values `value1` and `value2`.   The suggest solution to 
fix this issue is:   shuffling by the primary key before writing those rows 
into apache iceberg table because we could ensure that the rows with the same 
keys are wrote to iceberg table with the same order as they produced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to