[GitHub] [spark] rdblue commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

GitBox Fri, 06 Sep 2019 15:39:29 -0700

rdblue commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for 
DataSource V2
URL: https://github.com/apache/spark/pull/25626#issuecomment-529036736
 
 
   @xianyinxin, can you describe the use case you have for adding this to Spark?
   
   We discussed this PR in the last DSv2 sync; sorry that you weren't able to 
join us, that probably would have cleared up our questions more easily. We 
ended up with a few questions about this because it seems like this PR is 
adding source push-down for an operation before adding the operation itself.
   
   What I mean is that Spark doesn't currently support this kind of update. To 
add Spark support for `UPDATE`, I would expect that we would add an 
implementation that reads the rows that might match the where query, finds all 
the rows that actually match, updates those rows, and saves the changed rows 
back to the data source. Since Spark is a distributed SQL engine, I would 
expect all that to happen in parallel on executors. But, some sources, like 
JDBC, don't need Spark to handle the operation because they can push it down to 
the underlying store using an API like the one you're proposing.
   
   It seems strange to me to add the push-down API first, when it would be used 
to proxy UPDATE calls to data stores that already support those operations. I 
know that other stores could also implement the UPDATE call, but a file-based 
store would need to do all of the reading, processing, and writing on the 
driver where this call is made. That's a surprising implementation given that 
Spark is intended for distributed SQL operations.
   
   I know there is some precedent for this with the `DELETE FROM` API, but that 
plan is different in a critical way: deleting data doesn't require processing 
individual rows. Some deletes are metadata-only operations and that's why they 
can be handled by a single node.
   
   I think it would really help me understand this PR if you could describe the 
source you're implementing and why this operation makes sense. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rdblue commented on issue #25626: [SPARK-28892][SQL] Add UPDATE support for DataSource V2

Reply via email to