[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

liancheng Wed, 10 Jan 2018 00:17:41 -0800

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20174#discussion_r160612592
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
    @@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends 
Rule[LogicalPlan] {
               Alias(new First(attr).toAggregateExpression(), 
attr.name)(attr.exprId)
             }
           }
    -      Aggregate(keys, aggCols, child)
    +      // SPARK-22951: the implementation of aggregate operator treats the 
cases with and without
    +      // grouping keys differently, when there are not input rows. For the 
aggregation after
    +      // `dropDuplicates()` on an empty data frame, a grouping key is 
added here to make sure the
    +      // aggregate operator can work correctly (returning an empty 
iterator).
    --- End diff --
    
    > SPARK-22951: Physical aggregate operators distinguishes global 
aggregation and grouping aggregations by checking the number of grouping keys. 
The key difference here is that a global aggregation always returns at least 
one row even if there are no input rows. Here we append a literal when the 
grouping key list is empty so that the result aggregate operator is properly 
treated as a grouping aggregation.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

Reply via email to