Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/20174#discussion_r160770992
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -666,4 +665,16 @@ class DataFrameAggregateSuite extends QueryTest with
SharedSQLContext {
assert(exchangePlans.length == 1)
}
}
+
+ Seq(true, false).foreach { codegen =>
+ test("SPARK-22951: dropDuplicates on empty data frames should produce
correct aggregate" +
+ s" results when codegen enabled: $codegen") {
+ withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key,
codegen.toString)) {
+ assert(Seq.empty[Int].toDF("a").count() == 0)
+ assert(Seq.empty[Int].toDF("a").agg(count("*")).count() == 1)
+ assert(spark.emptyDataFrame.dropDuplicates().count() == 0)
+
assert(spark.emptyDataFrame.dropDuplicates().agg(count("*")).count() == 1)
--- End diff --
@liufengdb Maybe also add assertions to confirm that explicit global
aggregations (by providing zero grouping keys) still return one row? For
example:
```scala
val emptyAgg = Map.empty[String, String]
checkAnswer(
spark.emptyDataFrame.agg(emptyAgg),
Seq(Row())
)
checkAnswer(
spark.emptyDataFrame.groupBy().agg(emptyAgg),
Seq(Row())
)
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]