aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2552388682
I don't know if it makes sense regarding Spark/Iceberg internals. If we consider the following example <table> <tr><th>Table A huge partition to split </th><th>Table B partition to replicate </th></tr> <tr><td> | date | id | value | | ----------- | ----------- | ----------- | | 10/10/2024 | 1 | a | | 10/10/2024 | 2 | b | | 10/10/2024 | 3 | c | | 10/10/2024 | 4 | d | | 10/10/2024 | 5 | e | | 10/10/2024 | 6 | f | </td><td> | date | id | value | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td></tr> </table> After the partition split we'll get the following : <table> <tr><th>Table A splited partition</th><th>Table B replicated partition </th></tr> <tr> <td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 1 | a | | 10/10/2024 | 2 | b| </td><td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td> </tr> <tr> <td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 3 | c| | 10/10/2024 | 4 | d| </td><td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td> </tr> <tr> <td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 5 | e | | 10/10/2024 | 6 | f| </td><td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td> </tr> </table> In case we have a `if not matched then insert *`, as you explained above each replicated partition from table B will be inserted and therefore we'll have duplicates. The idea I was suggesting consists in : - Computing an aggregated boolean that tells if `not matched ` check is true for all of theses replicated partitions. - If this flag is true, then assign only one replicated partition to execute the insert statement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
