aokolnychyi commented on PR #35395:
URL: https://github.com/apache/spark/pull/35395#issuecomment-1095579921

   > shall we apply filter pushdown twice for simple DELETE execution? e.g. we 
first pushdown the DELETE condition to identify the files we need to replace, 
then we pushdown the negated DELETE condition to prune the parquet row groups.
   
   @cloud-fan, I think discarding entire row groups is possible only for 
DELETEs when the whole condition was successfully translated into data source 
filters. This isn’t something we can support for other commands like UPDATE or 
when certain parts of the condition couldn’t be converted to a data source 
filter (e.g. subquery).
   
   A few points on my mind right now:
   - How will data sources know what condition is for filtering files and what 
for filtering row groups without changes to the API?
   - Creating a scan builder in one rule and then configuring it further in 
another one will make the main planning rule even more complicated than it is 
today.
   
   Technically, if we simply extend the scan builder API to indicate that the 
entire condition is being pushed down, it should be sufficient for data sources 
to discard entire row groups of delete records. We already pass the SQL command 
and the condition. Data sources just don't know whether it is the entire 
condition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to