Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Currently, I do not have a solution for supporting the parallel mass 
UPDATE, because the rows in the DataFrame might be out of order and a global 
transaction is missing. The solution posted in this PR makes many assumptions. 
If users need to do it, the current suggested workaround it to let Spark SQL 
insert the results to a table and use a separate RDBMS application to do the 
update (outside Spark SQL). 
    
    I fully understand the challenges. I can post a solution which I did in the 
database replication area. https://www.google.com/patents/US20050193041 
Although this patent still has a hole, it generally explains how to do it. In 
that use case, we can do the parallel update/insert/delete by using the 
maintained transaction dependencies and retries logics with spill queues. 
Unfortunately, it is not applicable to Spark SQL. 
    
    `UPSERT` is pretty useful to Spark SQL users. I prefer to using the 
capability provided by RDBMS directly, instead of implementing it in Spark SQL. 
Then, we can avoid fetching/joining the data from the JDBC tables. More 
importantly, we can ensure each individual UPSERT works correctly even if the 
target tables are inserting/updating by the other applications at the same time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to