rdblue commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2 URL: https://github.com/apache/spark/pull/25626#issuecomment-529036736 @xianyinxin, can you describe the use case you have for adding this to Spark? We discussed this PR in the last DSv2 sync; sorry that you weren't able to join us, that probably would have cleared up our questions more easily. We ended up with a few questions about this because it seems like this PR is adding source push-down for an operation before adding the operation itself. What I mean is that Spark doesn't currently support this kind of update. To add Spark support for `UPDATE`, I would expect that we would add an implementation that reads the rows that might match the where query, finds all the rows that actually match, updates those rows, and saves the changed rows back to the data source. Since Spark is a distributed SQL engine, I would expect all that to happen in parallel on executors. But, some sources, like JDBC, don't need Spark to handle the operation because they can push it down to the underlying store using an API like the one you're proposing. It seems strange to me to add the push-down API first, when it would be used to proxy UPDATE calls to data stores that already support those operations. I know that other stores could also implement the UPDATE call, but a file-based store would need to do all of the reading, processing, and writing on the driver where this call is made. That's a surprising implementation given that Spark is intended for distributed SQL operations. I know there is some precedent for this with the `DELETE FROM` API, but that plan is different in a critical way: deleting data doesn't require processing individual rows. Some deletes are metadata-only operations and that's why they can be handled by a single node. I think it would really help me understand this PR if you could describe the source you're implementing and why this operation makes sense. Thanks!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
