aokolnychyi commented on PR #35395: URL: https://github.com/apache/spark/pull/35395#issuecomment-1095579921
> shall we apply filter pushdown twice for simple DELETE execution? e.g. we first pushdown the DELETE condition to identify the files we need to replace, then we pushdown the negated DELETE condition to prune the parquet row groups. @cloud-fan, I think discarding entire row groups is possible only for DELETEs when the whole condition was successfully translated into data source filters. This isn’t something we can support for other commands like UPDATE or when certain parts of the condition couldn’t be converted to a data source filter (e.g. subquery). A few points on my mind right now: - How will data sources know what condition is for filtering files and what for filtering row groups without changes to the API? - Creating a scan builder in one rule and then configuring it further in another one will make the main planning rule even more complicated than it is today. Technically, if we simply extend the scan builder API to indicate that the entire condition is being pushed down, it should be sufficient for data sources to discard entire row groups of delete records. We already pass the SQL command and the condition. Data sources just don't know whether it is the entire condition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
