Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/20174#discussion_r160612592
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
---
@@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends
Rule[LogicalPlan] {
Alias(new First(attr).toAggregateExpression(),
attr.name)(attr.exprId)
}
}
- Aggregate(keys, aggCols, child)
+ // SPARK-22951: the implementation of aggregate operator treats the
cases with and without
+ // grouping keys differently, when there are not input rows. For the
aggregation after
+ // `dropDuplicates()` on an empty data frame, a grouping key is
added here to make sure the
+ // aggregate operator can work correctly (returning an empty
iterator).
--- End diff --
> SPARK-22951: Physical aggregate operators distinguishes global
aggregation and grouping aggregations by checking the number of grouping keys.
The key difference here is that a global aggregation always returns at least
one row even if there are no input rows. Here we append a literal when the
grouping key list is empty so that the result aggregate operator is properly
treated as a grouping aggregation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]