holdenk opened a new pull request, #46143:
URL: https://github.com/apache/spark/pull/46143

   ### What changes were proposed in this pull request?
   
   Changes the filter pushDown optimizer to not push down past projections of 
the same element if we reasonable expect that computing that element is likely 
to be expensive.
   
   This is a more complex alternative to 
https://github.com/apache/spark/pull/45802 which also moves parts of 
projections down so that the filters can move further down.
   
   This introduces an "expectedCost" mechanism which we may or may not want. 
Previous filter ordering work used filter pushdowns as an approximation of 
expression cost but here we need more granularity. As an alternative we could 
introduce a flag for expensive rather than numeric operations. Another 
alternative would be seeing if the predicate can be "converted" as a proxy for 
cheap.
   
   ### Future Work / What else remains to do?
   
   Right now if a cond is expensive and it references something in the 
projection we don't push-down. We could probably do better and gate this on if 
the thing we are reference is expensive rather than the condition it's self. We 
could do this as a follow up item or as part of this PR.
   
   ### Why are the changes needed?
   
   Currently Spark may double compute expensive operations (like json parsing, 
UDF eval, etc.) as a result of filter pushdown past projections.
   
   ### Does this PR introduce _any_ user-facing change?
   
   SQL optimizer change may impact some user queries, results should be the 
same and hopefully a little faster.
   
   ### How was this patch tested?
   
   New tests were added to the FilterPushDownSuite, and the initial problem of 
double evaluation was confirmed with a github gist 
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to