[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

GitBox Sun, 21 Jun 2020 16:51:49 -0700


viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443269072




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column 
expressions.
     val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       Generally I agree. However, we need to use `distinctExpressions` for the 
`normalizedNamedDistinctExpressions`. 
   
   If we want to put this `NormalizeFloatingNumbers` invocation in 
`SparkStrategies`, we need to move `distinctExpressions` from 
`AggUtils.planAggregateWithOneDistinct` to `SparkStrategies` too. 
   
   It would look like, in `SparkStrategies`:
   
   ```scala
   if (functionsWithDistinct.isEmpty) {
     AggUtils.planAggregateWithoutDistinct(
       normalizedGroupingExpressions,
       aggregateExpressions,
       resultExpressions,
       planLater(child))
   } else {
     // functionsWithDistinct is guaranteed to be non-empty. Even though it may 
contain more than one
     // DISTINCT aggregate function, all of those functions will have the same 
column expressions.
     // For example, it would be valid for functionsWithDistinct to be
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column 
expressions.
     val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
     val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
         // Ideally this should be done in `NormalizeFloatingNumbers`, but we 
do it here because
         // `groupingExpressions` is not extracted during logical phase.
         NormalizeFloatingNumbers.normalize(e) match {
           case ne: NamedExpression => ne
           case other => Alias(other, other.toString)()
         }
       }
     AggUtils.planAggregateWithOneDistinct(
       normalizedGroupingExpressions,
       functionsWithDistinct,
       functionsWithoutDistinct,
       distinctExpressions,
       normalizedNamedDistinctExpressions,
       resultExpressions,
       planLater(child))
   }
   ```
   
   This leaks more details from `AggUtil` in `SparkStrategies`. Looks not 
pretty good.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Reply via email to