spark git commit: [SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double

dbtsai Mon, 10 Apr 2017 17:15:52 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 735e2039a -> aec37529f



[SPARK-20270][SQL] na.fill should not change the values in long or integer when 
the default value is in double

## What changes were proposed in this pull request?

This bug was partially addressed in SPARK-18555 
https://github.com/apache/spark/pull/15994, but the root cause isn't completely 
solved. This bug is pretty critical since it changes the member id in Long in 
our application if the member id can not be represented by Double losslessly 
when the member id is very big.

Here is an example how this happens, with
```
      Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
(9123146099426677101L, null),
        (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
```
the logical plan will be
```
== Analyzed Logical Plan ==
a: bigint, b: double
Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) 
AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS 
b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
   +- LocalRelation [_1#229L, _2#230]
```

Note that even the value is not null, Spark will cast the Long into Double 
first. Then if it's not null, Spark will cast it back to Long which results in 
losing precision.

The behavior should be that the original value should not be changed if it's 
not null, but Spark will change the value which is wrong.

With the PR, the logical plan will be
```
== Analyzed Logical Plan ==
a: bigint, b: double
Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, 
cast(null as double)), cast(0.2 as double)) AS b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
   +- LocalRelation [_1#229L, _2#230]
```
which behaves correctly without changing the original Long values and also 
avoids extra cost of unnecessary casting.

## How was this patch tested?

unit test added.

+cc srowen rxin cloud-fan gatorsmile

Thanks.

Author: DB Tsai <d...@netflix.com>

Closes #17577 from dbtsai/fixnafill.

(cherry picked from commit 1a0bc41659eef317dcac18df35c26857216a4314)
Signed-off-by: DB Tsai <dbt...@dbtsai.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aec37529
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aec37529
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aec37529

Branch: refs/heads/branch-2.0
Commit: aec37529f0213855f72db569a28aa6f5c37b368e
Parents: 735e203
Author: DB Tsai <d...@netflix.com>
Authored: Mon Apr 10 05:16:34 2017 +0000
Committer: DB Tsai <dbt...@dbtsai.com>
Committed: Tue Apr 11 00:14:30 2017 +0000

----------------------------------------------------------------------
 .../org/apache/spark/sql/DataFrameNaFunctions.scala   |  5 +++--
 .../apache/spark/sql/DataFrameNaFunctionsSuite.scala  | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/aec37529/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
index 3df8bb9..d97fd72 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
@@ -408,10 +408,11 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
     val quotedColName = "`" + col.name + "`"
     val colValue = col.dataType match {
       case DoubleType | FloatType =>
-        nanvl(df.col(quotedColName), lit(null)) // nanvl only supports these 
types
+        // nanvl only supports these types
+        nanvl(df.col(quotedColName), lit(null).cast(col.dataType))
       case _ => df.col(quotedColName)
     }
-    coalesce(colValue, lit(replacement)).cast(col.dataType).as(col.name)
+    coalesce(colValue, lit(replacement).cast(col.dataType)).as(col.name)
   }
 
   /**

http://git-wip-us.apache.org/repos/asf/spark/blob/aec37529/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala
index fd82984..aa237d0 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala
@@ -146,6 +146,20 @@ class DataFrameNaFunctionsSuite extends QueryTest with 
SharedSQLContext {
     )
 
     checkAnswer(
+      Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
(9123146099426677101L, null),
+        (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2),
+      Row(0, 3.14) :: Row(9123146099426677101L, 0.2) :: 
Row(9123146560113991650L, 1.6)
+        :: Row(0, 0.2) :: Nil
+    )
+
+    checkAnswer(
+      Seq[(java.lang.Long, java.lang.Float)]((null, 3.14f), 
(9123146099426677101L, null),
+        (9123146560113991650L, 1.6f), (null, null)).toDF("a", 
"b").na.fill(0.2),
+      Row(0, 3.14f) :: Row(9123146099426677101L, 0.2f) :: 
Row(9123146560113991650L, 1.6f)
+        :: Row(0, 0.2f) :: Nil
+    )
+
+    checkAnswer(
       Seq[(java.lang.Long, java.lang.Double)]((null, 1.23), (3L, null), (4L, 
3.45))
         .toDF("a", "b").na.fill(2.34),
       Row(2, 1.23) :: Row(3, 2.34) :: Row(4, 3.45) :: Nil


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20270][SQL] na.fill should not change the values in long or integer when the default value is in double

Reply via email to