allisonwang-db opened a new pull request #32054:
URL: https://github.com/apache/spark/pull/32054


   ### What changes were proposed in this pull request?
   This PR adds two additional checks in `CheckAnalysis` for correlated scalar 
subquery in Aggregate. It blocks the cases that Spark do not currently support 
based on the rewrite logic in `RewriteCorrelatedScalarSubquery`:
   
https://github.com/apache/spark/blob/aff6c0febb40d9713895ba00d8c77ba00f04bd16/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala#L618-L624
   
   ### Why are the changes needed?
   It can be confusing to users when their queries pass the check analysis but 
cannot be executed. Also, the error messages are confusing:
   
   #### Case 1: correlated scalar subquery in the grouping expressions but not 
in aggregate expressions 
   
   ```sql
   SELECT SUM(c2) FROM t t1 GROUP BY (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = 
t2.c1)
   ```
   We get this error:
   ```
   java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis  
   ```
   because the correlated scalar subquery is not rewritten properly: 
   ```scala
   == Optimized Logical Plan ==
   Aggregate [scalar-subquery#5 [(c1#6 = c1#6#93)]], [sum(c2#7) AS sum(c2)#11L]
   :  +- Aggregate [c1#6], [sum(c2#7) AS sum(c2)#15L, c1#6 AS c1#6#93]
   :     +- LocalRelation [c1#6, c2#7]
   +- LocalRelation [c1#6, c2#7]
   ``` 
   
   #### Case 2: correlated scalar subquery in the aggregate expressions but not 
in the grouping expressions 
   
   ```sql
   SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1), SUM(c2) FROM t t1 
GROUP BY c1
   ```
   We get this error:
   ```
   java.lang.IllegalStateException: Couldn't find sum(c2)#69L in 
[c1#60,sum(c2#61)#64L]
   ```
   because the transformed correlated scalar subquery output is not present in 
the grouping expression of the Aggregate:
   ```scala
   == Optimized Logical Plan ==
   Aggregate [c1#60], [sum(c2)#69L AS scalarsubquery(c1)#70L, sum(c2#61) AS 
sum(c2)#65L]
   +- Project [c1#60, c2#61, sum(c2)#69L]
      +- Join LeftOuter, (c1#60 = c1#60#95)
         :- LocalRelation [c1#60, c2#61]
         +- Aggregate [c1#60], [sum(c2#61) AS sum(c2)#69L, c1#60 AS c1#60#95]
            +- LocalRelation [c1#60, c2#61]
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   ### How was this patch tested?
   New unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to