[GitHub] [spark] aokolnychyi commented on a change in pull request #30562: [SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete

GitBox Wed, 02 Dec 2020 00:49:47 -0800


aokolnychyi commented on a change in pull request #30562:
URL: https://github.com/apache/spark/pull/30562#discussion_r533990021




##########
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java
##########
@@ -28,6 +28,25 @@
  */
 @Evolving
 public interface SupportsDelete {
+
+  /**
+   * Checks whether it is possible to delete data from a data source table 
that matches filter
+   * expressions.
+   * <p>
+   * Rows should be deleted from the data source iff all of the filter 
expressions match.
+   * That is, the expressions must be interpreted as a set of filters that are 
ANDed together.
+   * <p>
+   * Spark will call this method to check if the delete is possible without 
significant effort.

Review comment:
       > So canDeleteWhere is a much light-weight approach to know deleteWhere 
will reject a delete operation without actually calling deleteWhere.
   
   Yes, @viirya. It actually provides us with a way to check if a data source 
is going to reject the delete via `deleteWhere` at planning time instead of 
just getting an exception during execution. In the future, we can use this 
functionality to decide whether Spark should rewrite this delete and execute a 
distributed query or it can just pass a set of filters.
   
   Consider an example of a partitioned Hive table. If we have a delete 
predicate like `part_col = '2020'`, we can just drop the matching partition to 
satisfy this delete. In this case, the data source should return `true` from 
`canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the 
partition. I consider this as a delete without significant effort. At the same 
time, if we have a delete predicate like `id = 10`, Hive tables would not be 
able to execute this delete using a metadata only operation without rewriting 
files. In that case, the data source should return `false` from 
`canDeleteWhere` and we should use a more sophisticated row-level API to find 
out which records should be removed (the API is yet to be discussed, but we 
need this PR as a basis).
   
   If we decide to support subqueries and all delete use cases by simply 
extending the existing API, this will mean all data sources will have to 
implement a lot of Spark logic to determine which records changed. I don't 
think we want to go that way as the Spark logic to determine which records 
should be deleted is independent of the underlying data source. So the 
assumption is that Spark will execute a plan to find which records must be 
deleted for data sources that return `false` from `canDeleteWhere`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] aokolnychyi commented on a change in pull request #30562: [SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete

Reply via email to