adriangb commented on code in PR #20117:
URL: https://github.com/apache/datafusion/pull/20117#discussion_r2760298530


##########
datafusion/optimizer/src/push_down_filter.rs:
##########
@@ -1295,6 +1297,18 @@ fn rewrite_projection(
     predicates: Vec<Expr>,
     mut projection: Projection,
 ) -> Result<(Transformed<LogicalPlan>, Option<Expr>)> {
+    // Note: This check coordinates with ExtractLeafExpressions optimizer rule.
+    // See extract_leaf_expressions.rs for details on why these projections 
exist.
+    // Don't push filters through extracted expression projections.
+    // Pushing filters through would rewrite expressions like 
`__datafusion_extracted_1 > 150`
+    // back to `get_field(s,'value') > 150`, undoing the extraction.
+    if is_extracted_expr_projection(&projection) {

Review Comment:
   This is obviously not great, but I don't see another way to avoid this. 
Otherwise if we have:
   
   ```
   Filter: get_field(col, 'foo') > 1
     TableScan: projection=[col]
   ```
   
   And we run our new rule to get:
   
   ```
   Projection: col('col')
     Filter: __datafusion_extracted_1  > 1
       Projection: get_field(col, 'foo') as  __datafusion_extracted_1, col
           TableScan: projection=[col]
   ```
   
   Then this rule runs and will produce:
   
   ```
   Projection: col('col')
       Projection: get_field(col, 'foo') as  __datafusion_extracted_1, col
           Filter: get_field(col, 'foo') > 1
                TableScan projection=[col]
   ```
   
   Because it wants to push the filter under the projection.
   I'd argue as a general rule there's no point in pushing a filter under a 
projection that is purely column selections / get_field expressions especially 
if we can't then push it further.
   Maybe a more robust fix would be to have the filter pushdown optimizer rule 
traverse the rest of the plan, find the position it plans to push into and then 
check if there's any advantage to doing some (i.e. is it pushing the filter 
under an expensive operator that benefits from less input data, or is it just 
doing a trivial pointless pushdown like in the case above). But that would be a 
lot more involved so I chose this simpler solution for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to