Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16692
  
    Hi, all - thanks for this submission. Overall it's a very clean 
implementation and I like it a lot. There's obviously a large amount of effort 
that went into developing this. The main issue with this approach however is 
that the Upsert statement itself is an extremely expensive operation. Depending 
on how your uniqueness condition is defined, validating against the uniqueness 
constraint proves to be the most expensive part of this whole sequence. In 
https://github.com/apache/spark/pull/16685 I chose to implement this by reading 
in the existing table and doing a join operation to identify conflicts. The 
reason for this is that operation is easily distributed across the entire 
dataset. 
    
    In contrast, the implementation as it stands in this PR ultimately depends 
entirely on the database to enforce the uniqueness constraint, something that 
in fact can ONLY be executed serially and requires a full traversal of the 
index created on the uniqueness constraint. Furthermore, this index, in both 
MySQL and Postgres (the examples you've provided) cannot be implemented as a 
Hash index. Unless the owner of the database manually computes and enforces 
hashes on individual rows, this approach instead relies on btree indices to do 
this lookup.
    
    This is a marginal cost when the btree is on a single field but if the 
uniqueness constraint spans multiple columns, this index is implemented as 
nested btrees. This, in turn, proves to be an extraordinarily costly update 
with non-linear performance degradation as both the size of the database and 
the size of the table being upserted increase.
    
    This approach mirrors our initial approach to the problem but we ultimately 
moved away from this approach in favor of the one in 
https://github.com/apache/spark/pull/16685 for performance reasons. We were 
able to achieve a more than 10x performance increase, even taking into account 
the cost of the additional joins. Our tests were not massive - we tested 
against a roughly 10gb database in Postgres with approximately 10 million rows 
- on a relatively middle-line machine. I would love to know if you guys have 
done any performance benchmarks with this approach and if you could try out the 
approach in  https://github.com/apache/spark/pull/16685 and let me know how 
that performs. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to