[GitHub] [iceberg] saimigo commented on issue #7554: How to ensure the data is not repeated when using spark to write to the iceberg table

via GitHub Mon, 08 May 2023 20:10:28 -0700


saimigo commented on issue #7554:
URL: https://github.com/apache/iceberg/issues/7554#issuecomment-1539324853


   > @saimigo Merge into command is executed based on certain condition. A 
common use case is to use it for `UPSERT` operation. E.g.
   > 
   > ```sql
   > MERGE INTO target t   -- a target table
   > USING source s        -- your source (can be a table)
   > ON t.id = s.id                -- condition to find updates for target rows 
(can be your unique identifiers)
   > WHEN MATCHED UPDATE *
   > WHEN NOT MATCHED INSERT *
   > ```
   > 
   > To address few questions,
   > 
   > 1. You can to guarantee uniqueness of your data by dropping duplicates in 
the source's Spark Dataframe or by doing UPSERT against the target table.
   > 2. You don't have to delete the table when doing UPSERT.
   > 3. Be in mind, Spark engine will perform join based on condition you 
define. This means it requires to read from a table to find the matching row, 
so the target table must exist in the first place.
   
   Please, Is there any example or document of dataframe api related to merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] saimigo commented on issue #7554: How to ensure the data is not repeated when using spark to write to the iceberg table

Reply via email to