holdenk opened a new pull request, #53773:
URL: https://github.com/apache/spark/pull/53773

   ### What changes were proposed in this pull request?
   
   This is a follow on to 
[SPARK-47672](https://issues.apache.org/jira/browse/SPARK-47672). For example 
if you have a filter referencing two different columns added in a projection 
with regexes (or arbitrary functional calls, etc.) we should split the 
projection into two so the second regex need only be evaluated on the smaller 
data set.
   
    
   
   The logic for doing this gets kind of complex, and it can increase the size 
of the query plan, but it only increases the plan size where it would likely 
reduce the amount of data evaluated. There is a working implementation proposed 
as part of 47672, but it was decided it was too complex for part of a 
regressionfix.
   
   
   ### Why are the changes needed?
   
   Reducing the data being evaluated earlier in projection/filters can improve 
performance.
   
   
   ### Does this PR introduce _any_ user-facing change?
   We'd probably add a flag for this and queries would behave differently.
   
   ### How was this patch tested?
   More tests in the filter pushdown suite.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Gen AI was used for some of the test generation, although it got most of it 
backwards to start with and then we had to invert it's tests. Honestly that 
might have taken more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to