[GitHub] [spark] peter-toth commented on a change in pull request #31913: [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function

GitBox Mon, 05 Apr 2021 10:03:55 -0700


peter-toth commented on a change in pull request #31913:
URL: https://github.com/apache/spark/pull/31913#discussion_r607202293




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -622,31 +624,47 @@ case class Range(
 /**
  * This is a Group by operator with the aggregate functions and projections.
  *
- * @param groupingExpressions expressions for grouping keys
- * @param aggregateExpressions expressions for a project list, which could 
contain
- *                             [[AggregateExpression]]s.
+ * @param groupingExpressions Expressions for grouping keys.
+ * @param aggrExprWithGroupingRefs Expressions for a project list, which could 
contain
+ *                                 [[AggregateExpression]]s and 
[[GroupingExprRef]]s.
+ * @param child The child of the aggregate node.
+ * @param enforceGroupingReferences If [[aggrExprWithGroupingRefs]] should 
contain
+ *                                  [[GroupingExprRef]]s.
+ *
+ * [[aggrExprWithGroupingRefs]] without aggregate functions can contain 
[[GroupingExprRef]]
+ * expressions to refer to complex grouping expressions in 
[[groupingExpressions]]. These references
+ * ensure that optimization rules don't change the aggregate expressions to 
invalid ones that no
+ * longer refer to any grouping expressions and also simplify the expression 
transformations on the
+ * node (need to transform the expression only once).
  *
- * Note: Currently, aggregateExpressions is the project list of this Group by 
operator. Before
- * separating projection from grouping and aggregate, we should avoid 
expression-level optimization
- * on aggregateExpressions, which could reference an expression in 
groupingExpressions.
- * For example, see the rule 
[[org.apache.spark.sql.catalyst.optimizer.SimplifyExtractValueOps]]
+ * For example, in the following query Spark shouldn't optimize the aggregate 
expression
+ * `Not(IsNull(c))` to `IsNotNull(c)` as the grouping expression is 
`IsNull(c)`:
+ * SELECT not(c IS NULL)
+ * FROM t
+ * GROUP BY c IS NULL
+ * Instead, the aggregate expression should contain `Not(GroupingExprRef(0))`.
  */
-case class Aggregate(
+case class AggregateWithGroupingReferences(

Review comment:
       During the optimization phase some of the rules need the "expanded" 
version of aggregate expressions (references are restored to original 
expressions). This is why I wanted to keep the meaning of 
`Aggregate.aggregateExpressions` and the second argument of the 
`Aggregate.unapply()` extractor. Also when an optimizer rule creates an 
aggregate using `Aggregete.apply()` then it should automatically insert 
references the grouping expressions.
   I think it would be error prone if the optimizer rules need to keep in mind 
that they need to convert/restore these references and it is probably better to 
convert/restore references transparently. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on a change in pull request #31913: [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function

Reply via email to