allisonwang-db commented on a change in pull request #32054:
URL: https://github.com/apache/spark/pull/32054#discussion_r609148331



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
##########
@@ -305,6 +305,12 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
                     s"nor is it an aggregate function. " +
                     "Add to group by or wrap in first() (or first_value) if 
you don't care " +
                     "which value you get.")
+              case s: ScalarSubquery
+                  if s.children.nonEmpty && 
!groupingExprs.exists(_.semanticEquals(s)) =>
+                failAnalysis(s"Correlated scalar subquery '${s.sql}' is 
neither " +
+                  s"present in the group by, nor in an aggregate function. Add 
it to group by " +
+                  s"using ordinal position or wrap it in first() (or 
first_value) if you don't " +
+                  s"care which value you get.")

Review comment:
       This is a good point, but there is actually another bug that is harder 
to resolve right now (explained below). We can add an assertion in this PR so 
the code will fail during query optimization instead of execution time.
   ```sql
   CREATE VIEW t(c1, c2) AS VALUES (0,1), (0,2);
   SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1) FROM t t1 GROUP BY 1
   ```
   This will fail even if the query can pass CheckAnalysis because the subquery 
in the grouping expressions and in the aggregate expressions are the same 
before the optimization phase.
   ```scala
   Aggregate [scalar-subquery#18 [c1#19]], [scalar-subquery#18 [c1#19] AS 
scalarsubquery(c1)#27L]
   ```
    But after the `PullupCorrelatedPredicates` rule, and the subquery outputs 
are deduplicated, these two subqueries are no longer semantically equal:
   ```scala
   Aggregate [scalar-subquery#18 [(c1#19 = c1#19#31)]], [scalar-subquery#18 
[(c1#19 = c1#19#32)] AS scalarsubquery(c1)#27L]
   ``` 
   Notice the expression `c1#19 = c1#19#31` differs from `c1#19 = c1#19#32`.
   So the optimized plan still has an un-rewritten correlated scalar subquery.
   ```scala
   == Optimized Logical Plan ==
   Aggregate [scalar-subquery#18 [(c1#19 = c1#19#31)]], [sum(c2)#26L AS 
scalarsubquery(c1)#27L]
   :  +- Aggregate [c1#19], [sum(c2#20) AS sum(c2)#26L, c1#19 AS c1#19#31]
   :     +- LocalRelation [c1#19, c2#20]
   +- Project [c1#19, sum(c2)#26L]
      +- Join LeftOuter, (c1#19 = c1#19#32)
         :- LocalRelation [c1#19]
         +- Aggregate [c1#19], [sum(c2#20) AS sum(c2)#26L, c1#19 AS c1#19#32]
            +- LocalRelation [c1#19, c2#20]
   ```
   Note we cannot just compare the ID of the subquery expression because this 
case won't work:
   ```sql
   SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1) FROM t t1
   GROUP BY (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to