Github user xwu0226 commented on the issue:
https://github.com/apache/spark/pull/16685
A few comments:
1. The mayor concern is that this solution need to pull in the whole target
table data and do a join operation between the source dataframe and the target
table to determine potential rows for update and inserts. I am worried that
this join operation itself adds a lot of performance overhead for the upsert
operation. And during this decision making process, the target table may have
been advanced a lot, which makes the decision of inserts/updates worthless.
2. The primary key set provided may not be the exact match of potential
unique constraints on the target table, which will lead to failure of inserts
or updates, because some columns that are part of unique constraints maybe
outside of the provided primary key set.
3. The insert is batch execution of the same # of statements as # of insert
rows. Same for updates. We need to pass many statements via JDBC to target
database. Will it perform better if column values are set to host variables in
prepared statement for batch-size# of rows and executed once per batch?
4. Most of database systems provide UPSERT capability, such as` INSERT ON
DUPLICATE KEY UPDATE `from MySQL, `INSERT ON CONFLICT ... DO UPDATE SET` from
PostgreSQL, MERGE statement for DB2, oracle, etc., where whether insert or
update is decided by the database. Maybe we can take advantage of this by
expanding different JDBCDialect?
PR https://github.com/apache/spark/pull/16692 actually minimize the issues
above. Please take a look to compare.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]