Re: [PR] refactor filter pushdown apis [datafusion]

via GitHub Sat, 03 May 2025 15:41:08 -0700


adriangb commented on PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#issuecomment-2848840198


   > However, the existing ExecutionPlan APIs are general-purpose, 
self-explanatory, and simple, whereas these new ones feel harder to understand 
and are quite specific to the current rule.
   I initially tried to add something like filter(&self) -> Arc<dyn 
PhysicalExpr> and try_filter_pushdown(&self, filter: Arc<dyn PhysicalExpr>) -> 
Self, but couldn't get it passing all the tests. It seems theoretically 
possible, but I prefer not spending more time on it at this stage.
   
   I don't know that these new APIs are all that more specialized than 
something like `try_swapping_with_projection`, `supports_limit_pushdown` or 
`cardinality_effect`.
   Yes they're somewhat complex APIs but unless someone sees a way to simplify 
them and still address all of the use cases I don't think there's much to be 
done about that. In other words, the use case may warrant this level of 
complexity.
   For example: HashJoinExec needing to direct a different set of filters to 
each child, possibly modifying the projections and other bits, and also wanting 
to push down the own filters it generated - and needing to communicate the 
result of pushdown to it's parents - is just a _complex_ thing to describe. And 
because of that I don't even think it's theoretically possible to implement via 
`filter(&self) -> Arc<dyn PhysicalExpr>` + `try_filter_pushdown(&self, filter: 
Arc<dyn PhysicalExpr>) -> Self`. In my opinion our best bet is to craft good 
helper structs and functions to operate on the state, which I've attempted to 
do here but certainly could use further refinement.
   
   > Related to the first item, I'm also a bit concerned that the two new APIs 
-- gather_filters_for_pushdown and handle_child_pushdown_result -- are both 
doing parts of the pushdown work. The first one analyzes, the second applies 
changes. This feels a bit confusing, as the responsibilities aren't clearly 
separated (because I was always thinking the API's like 1-> information of 
having filter 2-> how do you treat these filters). At least, we should write in 
detail what each method and parameter means (I couldn't put them into words :/ )
   
   I don't follow what the confusing part is: I think of it the same way you 
are describing.
   One method takes in the parent filters and spits out a description of how to 
push those + any additional filters down to each child.
   The second method handles the result of pushdown to the children.
   I do agree more / better docstrings would help.
   
   > Another thing I couldn't fully understand is why we need to separate the 
analysis & pushdown work of parent filters vs. self filters. Could you give an 
example where keeping them separate is necessary?
   
   I think it helps the invariant of `filters should only be pushed down, not 
up`.
   Consider the case of a TopK node or FilterExec:
   1. It receives N filters from it's parents.
   2. It adds its own filter and pushes everything down to it's children.
   3. When `handle_child_pushdown_result` gets called it needs to split them 
back up: it needs to tell its parent the result of pushing down each one of the 
parent filters, but it should *not* also include the result of or a copy of its 
own filter, that would result in confusion at the least if not bugs with 
filters flowing up the execution plan.
   I think keeping them seperate makes it obvious which are the filters you 
injected into your children vs. the ones that came from your parents.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] refactor filter pushdown apis [datafusion]

Reply via email to