Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/16685
Currently, I do not have a solution for supporting the parallel mass
UPDATE, because the rows in the DataFrame might be out of order and a global
transaction is missing. The solution posted in this PR makes many assumptions.
If users need to do it, the current suggested workaround it to let Spark SQL
insert the results to a table and use a separate RDBMS application to do the
update (outside Spark SQL).
I fully understand the challenges. I can post a solution which I did in the
database replication area. https://www.google.com/patents/US20050193041
Although this patent still has a hole, it generally explains how to do it. In
that use case, we can do the parallel update/insert/delete by using the
maintained transaction dependencies and retries logics with spill queues.
Unfortunately, it is not applicable to Spark SQL.
`UPSERT` is pretty useful to Spark SQL users. I prefer to using the
capability provided by RDBMS directly, instead of implementing it in Spark SQL.
Then, we can avoid fetching/joining the data from the JDBC tables. More
importantly, we can ensure each individual UPSERT works correctly even if the
target tables are inserting/updating by the other applications at the same time.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]