maropu commented on a change in pull request #31913:
URL: https://github.com/apache/spark/pull/31913#discussion_r598368983



##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -870,8 +870,19 @@ object CollapseProject extends Rule[LogicalPlan] with 
AliasHelper {
       if (haveCommonNonDeterministicOutput(p.projectList, 
agg.aggregateExpressions)) {
         p
       } else {
-        agg.copy(aggregateExpressions = buildCleanedProjectList(
-          p.projectList, agg.aggregateExpressions))
+        val complexGroupingExpressions =
+          ExpressionSet(agg.groupingExpressions.filter(_.children.nonEmpty))
+
+        def wrapGroupingExpression(e: Expression): Expression = e match {
+          case _: AggregateExpression => e
+          case _ if complexGroupingExpressions.contains(e) => 
GroupingExpression(e)
+          case _ => e.mapChildren(wrapGroupingExpression)
+        }
+
+        val wrappedAggregateExpressions =
+          
agg.aggregateExpressions.map(wrapGroupingExpression(_).asInstanceOf[NamedExpression])
+        agg.copy(aggregateExpressions =
+          buildCleanedProjectList(p.projectList, wrappedAggregateExpressions))

Review comment:
       I think this issue is not related to `CollapseProject`. For example, we 
can reproduce it like this;
   ```
   $ ./bin/spark-shell --conf 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.CollapseProject
   
   Seq(Some(1), None).toDF("id").createOrReplaceTempView("t")
   val df = sql("""
     SELECT NOT(t.id IS NULL) AS X, count(*) AS c
     FROM t
     GROUP BY t.id IS NULL
   """)
   
   df.show()
   ```
   
   The query fails because `BooleanSimplification` applys illegal expr 
transformation to break the group-by constraint (that is, `group-by columns 
must appear in aggregate exprs`). In the ohter DBMS-like systems (e.g., 
PostgreSQL), the transformed query fails like this;
   ```
   postgres=# SELECT t.id IS NOT NULL AS X, count(*) AS c FROM t GROUP BY t.id 
IS NULL;
   ERROR:  column "t.id" must appear in the GROUP BY clause or be used in an 
aggregate function
   LINE 1: SELECT t.id IS NOT NULL AS X, count(*) AS c FROM t GROUP BY ...
   ```
   I'm currently not sure that this issue can happen only in 
`BooleanSimplification` though, I think we need a general solution to fix this 
kind of the illegal transformation instead of band-aid fixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to