Github user aray commented on a diff in the pull request:
https://github.com/apache/spark/pull/17226#discussion_r105322758
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala ---
@@ -216,4 +216,10 @@ class DataFramePivotSuite extends QueryTest with
SharedSQLContext{
Row("d", 15000.0, 48000.0) :: Row("J", 20000.0, 30000.0) :: Nil
)
}
+
+ test("pivot with null should not throw NPE") {
+ checkAnswer(
+ Seq(Tuple1(None),
Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count(),
+ Row(null, 1, null) :: Row(1, null, 1) :: Nil)
--- End diff --
Right the non optimized codepath should have been doing a null safe equals
in the if statement. I have fixed that in a81c062 and added a unit test.
As to whether an aggregate function of count(1) in a pivot should fill 0's
for null I think that is an orthogonal issue. First note that that it will
always* follow the optimized codepath as the choice is based on the return type
of the aggregate. Second its not clear that that is the expected result, for
instance pandas leaves those values as null and Oracle 11g gives 0 (Still need
to check R/reshape2 and MS SQL Server). I think it would be best to open
another JIRA ticket to discuss this further.
* unless there are multiple aggregates and one of them is not supported,
which is a consistancy problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]