[GitHub] spark pull request #17577: [SPARK-20270][SQL] na.fill should not change the ...

dbtsai Sun, 09 Apr 2017 01:06:07 -0700

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/17577


    [SPARK-20270][SQL] na.fill should not change the values in long or integer 
when the default value is in double

    ## What changes were proposed in this pull request?
    
    This bug was partially addressed in SPARK-18555 
https://github.com/apache/spark/pull/15994, but the root cause isn't completely 
solved. This bug is pretty critical since it changes the member id in Long in 
our application if the member id can not be represented by Double losslessly 
when the member id is very big.
    
    Here is an example how this happens, with
    ```
          Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), 
(9123146099426677101L, null),
            (9123146560113991650L, 1.6), (null, null)).toDF("a", 
"b").na.fill(0.2),
    ```
    the logical plan will be
    ```
    == Analyzed Logical Plan ==
    a: bigint, b: double
    Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as 
bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as 
double) AS b#241]
    +- Project [_1#229L AS a#232L, _2#230 AS b#233]
       +- LocalRelation [_1#229L, _2#230]
    ```
    
    Note that even the value is not null, Spark will cast the Long into Double 
first. Then if it's not null, Spark will cast it back to Long which results in 
losing precision.
    
    The behavior should be that the original value should not be changed if 
it's not null, but Spark will change the value which is wrong.
    
    With the PR, the logical plan will be
    ```
    == Analyzed Logical Plan ==
    a: bigint, b: double
    Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, 
coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241]
    +- Project [_1#229L AS a#232L, _2#230 AS b#233]
       +- LocalRelation [_1#229L, _2#230]
    ```
    which behaves correctly without changing the original Long values and also 
avoids extra cost of unnecessary casting.
    
    ## How was this patch tested?
    
    unit test added.
    
    +cc @srowen @rxin @cloud-fan @gatorsmile 
    
    Thanks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark fixnafill

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17577.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17577
    
----
commit cb5440787d676d2c74983ddcd1df31b38d009d71
Author: DB Tsai <d...@netflix.com>
Date:   2017-04-09T07:57:45Z

    na.fill will change the values in long or integer when the default value is 
in double

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17577: [SPARK-20270][SQL] na.fill should not change the ...

Reply via email to