[GitHub] [spark] cloud-fan commented on a change in pull request #26656: [SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression

GitBox Wed, 04 Dec 2019 04:32:17 -0800

cloud-fan commented on a change in pull request #26656: [SPARK-27986][SQL] 
Support ANSI SQL filter clause for aggregate expression
URL: https://github.com/apache/spark/pull/26656#discussion_r353717031


 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
 ##########
 @@ -135,19 +135,25 @@ object AggUtils {
     }
     val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
     val groupingAttributes = groupingExpressions.map(_.toAttribute)
+    val filterWithDistinctAttributes = 
functionsWithDistinct.flatMap(_.filterAttributes.toSeq)
 
     // 1. Create an Aggregate Operator for partial aggregations.
     val partialAggregate: SparkPlan = {
       val aggregateExpressions = functionsWithoutDistinct.map(_.copy(mode = 
Partial))
       val aggregateAttributes = aggregateExpressions.map(_.resultAttribute)
       // We will group by the original grouping expression, plus an additional 
expression for the
-      // DISTINCT column. For example, for AVG(DISTINCT value) GROUP BY key, 
the grouping
-      // expressions will be [key, value].
+      // DISTINCT column and the referred attributes in the FILTER clause 
associated with each
+      // aggregate function. For example:
+      // 1.for the AVG (DISTINCT value) GROUP BY key, the grouping expression 
will be [key, value];
+      // 2.for AVG (DISTINCT value) Filter (WHERE value2> 20) GROUP BY key, 
the grouping expression
+      // will be [key, value, value2].
 
 Review comment:
   I read the source code, seems the workflow is:
   1. local aggregate. **input**: records. **group by**: keys and the distinct 
column. **output**: agg buffers
   2. shuffle aggregate. **input**: agg buffers. **group by**: keys and the 
distinct column. **output**: agg buffers
   3. local aggregate. **input**: agg buffers. **group by**: keys. **output**: 
agg buffers
   4. shuffle aggregate. **input**: agg buffers. **group by**: keys. 
**output**: agg buffers
   
   So I think we only need to evaluate the filter in the first aggregate.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #26656: [SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression

Reply via email to