ahshahid opened a new pull request, #38714: URL: https://github.com/apache/spark/pull/38714
### What changes were proposed in this pull request? This is a PR for improvement When a subquery references the outer query's aggregate functions, in some cases, it ends up introducing extra aggregate functions which are not needed. Though they would get eventually eliminated in the optimizer, but atleast in analyzer phase would add an extra project node etc. The change is in the code of identification of OuterReference in subquery.scala. Currently whenever an aggregate expression is found, it is assumed to be the Outer Reference. With this change, the code checks whether the parent Expression can also be potentially part of the OuterReference too. So if we consider a query select cos (sum (a) ) , b from t1 having exists select 1 from t2 where x = cos ( sum(a) ) the OuterReference detected would be cos ( sum(a) ) instead of just sum(a). As a result, no extra aggregate would be added. ### Why are the changes needed? To avoid adding unnecessary aggregate in outer query thereby reducing the number of expressions to analyze, clone and also avoid adding an extra project node. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the precheckin tests and added new tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
