Github user xwu0226 commented on the issue: https://github.com/apache/spark/pull/16685 A few comments: 1. The mayor concern is that this solution need to pull in the whole target table data and do a join operation between the source dataframe and the target table to determine potential rows for update and inserts. I am worried that this join operation itself adds a lot of performance overhead for the upsert operation. And during this decision making process, the target table may have been advanced a lot, which makes the decision of inserts/updates worthless. 2. The primary key set provided may not be the exact match of potential unique constraints on the target table, which will lead to failure of inserts or updates, because some columns that are part of unique constraints maybe outside of the provided primary key set. 3. The insert is batch execution of the same # of statements as # of insert rows. Same for updates. We need to pass many statements via JDBC to target database. Will it perform better if column values are set to host variables in prepared statement for batch-size# of rows and executed once per batch? 4. Most of database systems provide UPSERT capability, such as` INSERT ON DUPLICATE KEY UPDATE `from MySQL, `INSERT ON CONFLICT ... DO UPDATE SET` from PostgreSQL, MERGE statement for DB2, oracle, etc., where whether insert or update is decided by the database. Maybe we can take advantage of this by expanding different JDBCDialect? PR https://github.com/apache/spark/pull/16692 actually minimize the issues above. Please take a look to compare.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org