Github user ilganeli commented on the issue:
https://github.com/apache/spark/pull/16685
@xwu0226 Thanks for the comments, I've reviewed your submission and
commented here https://github.com/apache/spark/pull/16692.
Specifically in response to your comments:
1) We did not find the join to be a limiting factor in our tests. Granted,
this is very dataset specific but conceptually, Spark can do distributed joins
very effectively and extracting the data from the database is an O(n)
operation. The main cost of this approach is the additional copy of data out of
the database and then back in as an INSERT + UPDATE. However, an UPSERT
operation is equivalent to a DELETE and INSERT operation. I think there may be
a slight horse race between CopyOutOFDb/INSERT/UPDATE and UPSERT but I'm not
convinced there's a dramatic performance cost in this step, particularly
considering the dramatic cost of enforcing the uniqueness constraint for UPSERT.
2) This is indeed a valid concern. This approach requires the Spark
programmer to enforce and maintain the uniqueness constraints on the table,
rather than the other way around. This is a conceptual shift from how things
are usually implemented (where the DB Admin is king) but in our case this
choice was justified by massive performance improvements.
3) I agree using Prepared Statement would be better. I tried initially with
Prepared Statement and ran into issues with certain datatypes (particularly
timestamps). I haven't yet tried with the wildcards as it's currently
implemented in JdbcUtils Insert statement, I think it's definitely doable that
way. This might also help to boost performance.
4) I like the approach that you guys took to expand JDBCDialect in
https://github.com/apache/spark/pull/16692. It's a well modularized approach.
Agree that something similar could be done here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]