Re: [PR] refactor filter pushdown apis [datafusion]

via GitHub Thu, 01 May 2025 06:07:26 -0700


adriangb commented on code in PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070247089



##########
datafusion/physical-plan/src/execution_plan.rs:
##########
@@ -471,39 +471,53 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
         Ok(None)
     }
 
-    /// Attempts to recursively push given filters from the top of the tree 
into leafs.
-    ///
-    /// This is used for various optimizations, such as:
-    ///
-    /// * Pushing down filters into scans in general to minimize the amount of 
data that needs to be materialzied.
-    /// * Pushing down dynamic filters from operators like TopK and Joins into 
scans.
-    ///
-    /// Generally the further down (closer to leaf nodes) that filters can be 
pushed, the better.
-    ///
-    /// Consider the case of a query such as `SELECT * FROM t WHERE a = 1 AND 
b = 2`.
-    /// With no filter pushdown the scan needs to read and materialize all the 
data from `t` and then filter based on `a` and `b`.
-    /// With filter pushdown into the scan it can first read only `a`, then 
`b` and keep track of
-    /// which rows match the filter.
-    /// Then only for rows that match the filter does it have to materialize 
the rest of the columns.
-    ///
-    /// # Default Implementation
-    ///
-    /// The default implementation assumes:
-    /// * Parent filters can't be passed onto children.
-    /// * This node has no filters to contribute.
-    ///
-    /// # Implementation Notes
-    ///
-    /// Most of the actual logic is implemented as a Physical Optimizer rule.
-    /// See [`PushdownFilter`] for more details.
-    ///
-    /// [`PushdownFilter`]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/filter_pushdown/struct.PushdownFilter.html
-    fn try_pushdown_filters(

Review Comment:
   Agreed I'd prefer a single method as well - but as per 
https://github.com/apache/datafusion/pull/15770#discussion_r2052311257 we 
probably would have had to add a new method to the existing implementation 
anyway. I don't see a way to have a single method without making it absurdly 
convoluted just to avoid 2 methods.
   
   I'll also point out that there are multiple dimensions of filter pushdown 
for each node:
   - does the node allow parent filters to be pushed down to it's children?
   - does the node want to add any filters to be passed to it's children?
   - does the node need to handle the result of pushdown?
   
   Having two methods makes it easy to have canned implementations for each one 
of these independently and combine/compose them.
   
   E.g.:
   - TopK or HashJoin will override `gather_filters_for_pushdown` but not 
`handle_child_pushdown_result`
   - FilterExec will override both
   - DataSourceExec will override only `handle_child_pushdown_result`
   - ProjectionExec will override only `gather_filters_for_pushdown`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] refactor filter pushdown apis [datafusion]

Reply via email to