abellina opened a new pull request, #36505: URL: https://github.com/apache/spark/pull/36505
### What changes were proposed in this pull request? This PR seeks to allow exists (but also NOT exists, IN, and NOT IN) to be rewritten as the appropriate join by `RewritePredicateSubquery` before `InferFiltersFromConstraints` runs. This allows the join conditions to infer filters, such as null filtering that can be pushed down to the scan. This particular example came about when executing a query derived from TPCDS q16 where we see an exists that is converted into a LeftSemi, but null filtering or push down filters don't exist, forcing Spark to load many null key rows that will not match (more details in the jira issue, with plans and a simpler repro case). I am particularly weary of making changes in the optimizer as rules depend on the order and build on each other. I copied (but believe I should move) `RewritePredicateSubquery` ahead of `InferFiltersFromConstraints`, which has the desired effect in my particular test case. I did not bring the whole `RewriteSubquery` batch with it, and that's my first question. Should all of `[RewriteSubquery](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L237)` move further up? With the new join and filter (and push down) added earlier now, does anyone know if this could cause conflicts with rules that come after? I've spent some time reading through each of the rules and I am not convinced that a push down rule, or join reorder isn't going to be affected. Any feedback would be appreciated. ### Why are the changes needed? Without this change, LeftSemi/LeftAnti inserted because of (NOT) exists or (NOT) in, will not get null key filtering or push down benefits. ### Does this PR introduce _any_ user-facing change? No, this is going to make some queries run faster. ### How was this patch tested? - Manual execution of a query based on TPCDS q16 - Unit test was added to assert that the FilterExec is added. I am happy to add more tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
