GitHub user dbtsai opened a pull request:
https://github.com/apache/spark/pull/17577
[SPARK-20270][SQL] na.fill should not change the values in long or integer
when the default value is in double
## What changes were proposed in this pull request?
This bug was partially addressed in SPARK-18555
https://github.com/apache/spark/pull/15994, but the root cause isn't completely
solved. This bug is pretty critical since it changes the member id in Long in
our application if the member id can not be represented by Double losslessly
when the member id is very big.
Here is an example how this happens, with
```
Seq[(java.lang.Long, java.lang.Double)]((null, 3.14),
(9123146099426677101L, null),
(9123146560113991650L, 1.6), (null, null)).toDF("a",
"b").na.fill(0.2),
```
the logical plan will be
```
== Analyzed Logical Plan ==
a: bigint, b: double
Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as
bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as
double) AS b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
+- LocalRelation [_1#229L, _2#230]
```
Note that even the value is not null, Spark will cast the Long into Double
first. Then if it's not null, Spark will cast it back to Long which results in
losing precision.
The behavior should be that the original value should not be changed if
it's not null, but Spark will change the value which is wrong.
With the PR, the logical plan will be
```
== Analyzed Logical Plan ==
a: bigint, b: double
Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L,
coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
+- Project [_1#229L AS a#232L, _2#230 AS b#233]
+- LocalRelation [_1#229L, _2#230]
```
which behaves correctly without changing the original Long values and also
avoids extra cost of unnecessary casting.
## How was this patch tested?
unit test added.
+cc @srowen @rxin @cloud-fan @gatorsmile
Thanks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dbtsai/spark fixnafill
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17577.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17577
----
commit cb5440787d676d2c74983ddcd1df31b38d009d71
Author: DB Tsai <[email protected]>
Date: 2017-04-09T07:57:45Z
na.fill will change the values in long or integer when the default value is
in double
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]