adriangb commented on PR #15801: URL: https://github.com/apache/datafusion/pull/15801#issuecomment-2848840198
> However, the existing ExecutionPlan APIs are general-purpose, self-explanatory, and simple, whereas these new ones feel harder to understand and are quite specific to the current rule. I initially tried to add something like filter(&self) -> Arc<dyn PhysicalExpr> and try_filter_pushdown(&self, filter: Arc<dyn PhysicalExpr>) -> Self, but couldn't get it passing all the tests. It seems theoretically possible, but I prefer not spending more time on it at this stage. I don't know that these new APIs are all that more specialized than something like `try_swapping_with_projection`, `supports_limit_pushdown` or `cardinality_effect`. Yes they're somewhat complex APIs but unless someone sees a way to simplify them and still address all of the use cases I don't think there's much to be done about that. In other words, the use case may warrant this level of complexity. For example: HashJoinExec needing to direct a different set of filters to each child, possibly modifying the projections and other bits, and also wanting to push down the own filters it generated - and needing to communicate the result of pushdown to it's parents - is just a _complex_ thing to describe. And because of that I don't even think it's theoretically possible to implement via `filter(&self) -> Arc<dyn PhysicalExpr>` + `try_filter_pushdown(&self, filter: Arc<dyn PhysicalExpr>) -> Self`. In my opinion our best bet is to craft good helper structs and functions to operate on the state, which I've attempted to do here but certainly could use further refinement. > Related to the first item, I'm also a bit concerned that the two new APIs -- gather_filters_for_pushdown and handle_child_pushdown_result -- are both doing parts of the pushdown work. The first one analyzes, the second applies changes. This feels a bit confusing, as the responsibilities aren't clearly separated (because I was always thinking the API's like 1-> information of having filter 2-> how do you treat these filters). At least, we should write in detail what each method and parameter means (I couldn't put them into words :/ ) I don't follow what the confusing part is: I think of it the same way you are describing. One method takes in the parent filters and spits out a description of how to push those + any additional filters down to each child. The second method handles the result of pushdown to the children. I do agree more / better docstrings would help. > Another thing I couldn't fully understand is why we need to separate the analysis & pushdown work of parent filters vs. self filters. Could you give an example where keeping them separate is necessary? I think it helps the invariant of `filters should only be pushed down, not up`. Consider the case of a TopK node or FilterExec: 1. It receives N filters from it's parents. 2. It adds its own filter and pushes everything down to it's children. 3. When `handle_child_pushdown_result` gets called it needs to split them back up: it needs to tell its parent the result of pushing down each one of the parent filters, but it should *not* also include the result of or a copy of its own filter, that would result in confusion at the least if not bugs with filters flowing up the execution plan. I think keeping them seperate makes it obvious which are the filters you injected into your children vs. the ones that came from your parents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org