allisonwang-db commented on a change in pull request #32054:
URL: https://github.com/apache/spark/pull/32054#discussion_r609148331
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -305,6 +305,12 @@ trait CheckAnalysis extends PredicateHelper with
LookupCatalog {
s"nor is it an aggregate function. " +
"Add to group by or wrap in first() (or first_value) if
you don't care " +
"which value you get.")
+ case s: ScalarSubquery
+ if s.children.nonEmpty &&
!groupingExprs.exists(_.semanticEquals(s)) =>
+ failAnalysis(s"Correlated scalar subquery '${s.sql}' is
neither " +
+ s"present in the group by, nor in an aggregate function. Add
it to group by " +
+ s"using ordinal position or wrap it in first() (or
first_value) if you don't " +
+ s"care which value you get.")
Review comment:
This is a good point, but there is actually another bug that is harder
to resolve right now (explained below). We can add an assertion in this PR so
the code will fail during query optimization instead of execution time.
```sql
CREATE VIEW t(c1, c2) AS VALUES (0,1), (0,2);
SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1) FROM t t1 GROUP BY 1
```
This will fail even if the query can pass CheckAnalysis because the subquery
in the grouping expressions and in the aggregate expressions are the same
before the optimization phase.
```scala
Aggregate [scalar-subquery#18 [c1#19]], [scalar-subquery#18 [c1#19] AS
scalarsubquery(c1)#27L]
```
But after the `PullupCorrelatedPredicates` rule, and the subquery outputs
are deduplicated, these two subqueries are no longer semantically equal:
```scala
Aggregate [scalar-subquery#18 [(c1#19 = c1#19#31)]], [scalar-subquery#18
[(c1#19 = c1#19#32)] AS scalarsubquery(c1)#27L]
```
Notice the expression `c1#19 = c1#19#31` differs from `c1#19 = c1#19#32`.
So the optimized plan still has an un-rewritten correlated scalar subquery.
```scala
== Optimized Logical Plan ==
Aggregate [scalar-subquery#18 [(c1#19 = c1#19#31)]], [sum(c2)#26L AS
scalarsubquery(c1)#27L]
: +- Aggregate [c1#19], [sum(c2#20) AS sum(c2)#26L, c1#19 AS c1#19#31]
: +- LocalRelation [c1#19, c2#20]
+- Project [c1#19, sum(c2)#26L]
+- Join LeftOuter, (c1#19 = c1#19#32)
:- LocalRelation [c1#19]
+- Aggregate [c1#19], [sum(c2#20) AS sum(c2)#26L, c1#19 AS c1#19#32]
+- LocalRelation [c1#19, c2#20]
```
Note we cannot just compare the ID of the subquery expression because this
case won't work:
```sql
SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1) FROM t t1
GROUP BY (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]