[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

GitBox Sun, 21 Jun 2020 19:41:49 -0700


dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column 
expressions.
     val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do 
it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       If we broaden the scope, `SparkStrategies` already is looking at the 
detail of `functionsWithDistinct` like the following.
   ```scala
           val (functionsWithDistinct, functionsWithoutDistinct) =
             aggregateExpressions.partition(_.isDistinct)
           if 
(functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length 
> 1) {
             // This is a sanity check. We should not reach here when we have 
multiple distinct
             // column sets. Our `RewriteDistinctAggregates` should take care 
this case.
             sys.error("You hit a query analyzer bug. Please report your query 
to " +
                 "Spark user mailing list.")
           }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
           // Ideally this should be done in `NormalizeFloatingNumbers`, but we 
do it here because
           // `groupingExpressions` is not extracted during logical phase.
           val normalizedGroupingExpressions = groupingExpressions.map { e =>
             NormalizeFloatingNumbers.normalize(e) match {
               case n: NamedExpression => n
               case other => Alias(other, e.name)(exprId = e.exprId)
             }
           }
   ```
   
   Given the above, I guess what you concerned is only one line source code, 
`val distinctExpressions = 
functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the following comment is about the definition of 
`functionsWithDistinct` which is generated by `SparkStrategies`. So, it's not a 
leaked detail or hidden from `SparkStrategies`. For me, it seems to be an 
assumption given by `SparkStrategies`.
   ```
   // functionsWithDistinct is guaranteed to be non-empty. Even though it may 
contain more than one
     // DISTINCT aggregate function, all of those functions will have the same 
column expressions.
     // For example, it would be valid for functionsWithDistinct to be
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), 
COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column 
expressions.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Reply via email to