Github user xwu0226 commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    A few comments:
    
    1. The mayor concern is that this solution need to pull in the whole target 
table data and do a join operation between the source dataframe and the target 
table to determine potential rows for update and inserts. I am worried that 
this join operation itself adds a lot of performance overhead for the upsert 
operation. And during this decision making process, the target table may have 
been advanced a lot, which makes the decision of inserts/updates worthless. 
    
    2. The primary key set provided may not be the exact match of potential 
unique constraints on the target table, which will lead to failure of inserts 
or updates, because some columns that are part of unique constraints maybe 
outside of the provided primary key set.
    
    3. The insert is batch execution of the same # of statements as # of insert 
rows. Same for updates. We need to pass many statements via JDBC to target 
database. Will it perform better if column values are set to host variables in 
prepared statement for batch-size# of rows and executed once per batch?
    
    4. Most of database systems provide UPSERT capability, such as` INSERT ON 
DUPLICATE KEY UPDATE `from MySQL, `INSERT ON CONFLICT ... DO UPDATE SET` from 
PostgreSQL, MERGE statement for DB2, oracle, etc., where whether insert or 
update is decided by the database. Maybe we can take advantage of this by 
expanding different JDBCDialect?
    
    PR https://github.com/apache/spark/pull/16692  actually minimize the issues 
above. Please take a look to compare. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to