aokolnychyi commented on a change in pull request #30562:
URL: https://github.com/apache/spark/pull/30562#discussion_r533990021
##########
File path:
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java
##########
@@ -28,6 +28,25 @@
*/
@Evolving
public interface SupportsDelete {
+
+ /**
+ * Checks whether it is possible to delete data from a data source table
that matches filter
+ * expressions.
+ * <p>
+ * Rows should be deleted from the data source iff all of the filter
expressions match.
+ * That is, the expressions must be interpreted as a set of filters that are
ANDed together.
+ * <p>
+ * Spark will call this method to check if the delete is possible without
significant effort.
Review comment:
> So canDeleteWhere is a much light-weight approach to know deleteWhere
will reject a delete operation without actually calling deleteWhere.
Yes, @viirya. It actually provides us with a way to check if a data source
is going to reject the delete via `deleteWhere` at planning time instead of
just getting an exception during execution. In the future, we can use this
functionality to decide whether Spark should rewrite this delete and execute a
distributed query or it can just pass a set of filters.
Consider an example of a partitioned Hive table. If we have a delete
predicate like `part_col = '2020'`, we can just drop the matching partition to
satisfy this delete. In this case, the data source should return `true` from
`canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the
partition. At the same time, if we have a delete predicate like `id = 10`, Hive
tables would not be able to execute this delete using a metadata only operation
without rewriting files. In that case, the data source should return `false`
from `canDeleteWhere` and we should use a more sophisticated row-level API to
find out which records should be removed (the API is yet to be discussed, but
we need this PR as a basis).
If we decide to support subqueries and all delete use cases by simply
extending the existing API, this will mean all data sources will have to
implement a lot of Spark logic to determine which records changed. I don't
think we want to go that way as the Spark logic to determine which records
should be deleted is independent of the underlying data source. So the
assumption is that Spark will execute a plan to find which records must be
deleted for data sources that return `false` from `canDeleteWhere`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]