openinx commented on issue #2610: URL: https://github.com/apache/iceberg/issues/2610#issuecomment-844595929
Did you have multiple parallelism in flink job to write the same keys to unpartitioned table ? Assume there's two operations: ``` 1. INSERT key1 value1; 2. DELETE key1 value1 ; 3. INSERT key1 value2 ; ``` As we have 2 parallelism to write those rows ( without shuffling by primary key), then it's possible that: The first parallelism accept the event2 ( DELETE key1 value1) and write to the iceberg table, the second parallelism accept the event1 and event3 and write to the iceberg table. Then finally the `DELETE key1 value1` won't mask the event1 & event3 because it happens before them and it would only delete all those events with the same key that happens before the delete. In this case, we will encounter two duplicate `INSERT key1` with different values `value1` and `value2`. The suggest solution to fix this issue is: shuffling by the primary key before writing those rows into apache iceberg table because we could ensure that the rows with the same keys are wrote to iceberg table with the same order as they produced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
