holdenk opened a new pull request, #53773: URL: https://github.com/apache/spark/pull/53773
### What changes were proposed in this pull request? This is a follow on to [SPARK-47672](https://issues.apache.org/jira/browse/SPARK-47672). For example if you have a filter referencing two different columns added in a projection with regexes (or arbitrary functional calls, etc.) we should split the projection into two so the second regex need only be evaluated on the smaller data set. The logic for doing this gets kind of complex, and it can increase the size of the query plan, but it only increases the plan size where it would likely reduce the amount of data evaluated. There is a working implementation proposed as part of 47672, but it was decided it was too complex for part of a regressionfix. ### Why are the changes needed? Reducing the data being evaluated earlier in projection/filters can improve performance. ### Does this PR introduce _any_ user-facing change? We'd probably add a flag for this and queries would behave differently. ### How was this patch tested? More tests in the filter pushdown suite. ### Was this patch authored or co-authored using generative AI tooling? Gen AI was used for some of the test generation, although it got most of it backwards to start with and then we had to invert it's tests. Honestly that might have taken more time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
