Github user ilganeli commented on the issue:
https://github.com/apache/spark/pull/16692
Hi, all - thanks for this submission. Overall it's a very clean
implementation and I like it a lot. There's obviously a large amount of effort
that went into developing this. The main issue with this approach however is
that the Upsert statement itself is an extremely expensive operation. Depending
on how your uniqueness condition is defined, validating against the uniqueness
constraint proves to be the most expensive part of this whole sequence. In
https://github.com/apache/spark/pull/16685 I chose to implement this by reading
in the existing table and doing a join operation to identify conflicts. The
reason for this is that operation is easily distributed across the entire
dataset.
In contrast, the implementation as it stands in this PR ultimately depends
entirely on the database to enforce the uniqueness constraint, something that
in fact can ONLY be executed serially and requires a full traversal of the
index created on the uniqueness constraint. Furthermore, this index, in both
MySQL and Postgres (the examples you've provided) cannot be implemented as a
Hash index. Unless the owner of the database manually computes and enforces
hashes on individual rows, this approach instead relies on btree indices to do
this lookup.
This is a marginal cost when the btree is on a single field but if the
uniqueness constraint spans multiple columns, this index is implemented as
nested btrees. This, in turn, proves to be an extraordinarily costly update
with non-linear performance degradation as both the size of the database and
the size of the table being upserted increase.
This approach mirrors our initial approach to the problem but we ultimately
moved away from this approach in favor of the one in
https://github.com/apache/spark/pull/16685 for performance reasons. We were
able to achieve a more than 10x performance increase, even taking into account
the cost of the additional joins. Our tests were not massive - we tested
against a roughly 10gb database in Postgres with approximately 10 million rows
- on a relatively middle-line machine. I would love to know if you guys have
done any performance benchmarks with this approach and if you could try out the
approach in https://github.com/apache/spark/pull/16685 and let me know how
that performs. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]