[GitHub] [spark] ahshahid opened a new pull request, #38714: [WIP][SPARK-41141]. avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it

GitBox Fri, 18 Nov 2022 10:36:53 -0800


ahshahid opened a new pull request, #38714:
URL: https://github.com/apache/spark/pull/38714


   ### What changes were proposed in this pull request?
   This is a PR for improvement
   When a subquery references the outer query's aggregate functions,  in some 
cases, it ends up introducing extra aggregate functions which are not needed. 
Though they would get eventually eliminated in the optimizer, but atleast in 
analyzer phase would add an extra project node etc.
   The change is in the code of identification of OuterReference in 
subquery.scala.
   Currently whenever an aggregate expression is found, it is assumed to be the 
Outer Reference.
   With this change,  the code checks whether the parent Expression can also be 
potentially part of the OuterReference too.
   So if we consider a query
   select cos (sum (a) ) , b from t1 having exists select 1 from t2 where x = 
cos ( sum(a) ) 
   
   the OuterReference detected would be cos ( sum(a) ) instead of just sum(a).
   As a result, no extra aggregate would be added.
   
   
   ### Why are the changes needed?
   To avoid adding unnecessary aggregate in outer query thereby reducing the 
number of expressions to analyze, clone and also avoid adding an extra project 
node.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Ran the precheckin tests and added new tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ahshahid opened a new pull request, #38714: [WIP][SPARK-41141]. avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it

Reply via email to