[ 
https://issues.apache.org/jira/browse/SPARK-18489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673528#comment-15673528
 ] 

Herman van Hovell edited comment on SPARK-18489 at 11/17/16 12:03 PM:
----------------------------------------------------------------------

[~hvanhovell] 
Will this change cover all the operators including UDFs. e.g.
{noformat}
val nullResolveFun: Int=> Int = {s=>{
     
    if(s==null)
     0
    else 
     s*2
    }
}

val nullResolve = udf(nullResolveFun)

df.withColumn("_c2_cleaned",nullResolve('_c2)).show
+---+---+----+-----------+
|_c0|_c1| _c2|_c2_cleaned|
+---+---+----+-----------+
|  1|1.0|   1|          2|
|  2|1.0|   s|       null|
|  3|3.1|null|       null|
+---+---+----+-----------+
{noformat}


was (Author: dasbipulkumar):
[~hvanhovell] 
Will this change cover all the operators including UDFs. e.g.

val nullResolveFun: Int=> Int = {s=>{
     
    if(s==null)
     0
    else 
     s*2
    }
}

val nullResolve = udf(nullResolveFun)

df.withColumn("_c2_cleaned",nullResolve('_c2)).show
+---+---+----+-----------+
|_c0|_c1| _c2|_c2_cleaned|
+---+---+----+-----------+
|  1|1.0|   1|          2|
|  2|1.0|   s|       null|
|  3|3.1|null|       null|
+---+---+----+-----------+

> Implicit type conversion during comparision between Integer type column and 
> String type column
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18489
>                 URL: https://issues.apache.org/jira/browse/SPARK-18489
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Bipul Kumar
>
> Suppose I have a dataframe with schema:
> {noformat}
> root
>  |-- _c0: integer (nullable = true)
>  |-- _c1: double (nullable = true)
>  |-- _c2: string (nullable = true)
> {noformat}
> and data:
> {noformat}
> +---+---+----+
> |_c0|_c1| _c2|
> +---+---+----+
> |  1|1.0|   1|
> |  2|1.0|   s|
> |  3|3.1|null|
> +---+---+----+
> {noformat}
> if the following operations are carried out:
> {noformat}
> df.where("_c1==_c2").show
> +---+---+---+
> |_c0|_c1|_c2|
> +---+---+---+
> |  1|1.0|  1|
> +---+---+---+
> {noformat}
> {noformat}
> df.where("_c1<>_c2").show   or   df.where("_c1!=_c2").show 
> +---+---+---+
> |_c0|_c1|_c2|
> +---+---+---+
> +---+---+---+
> {noformat}
> So the related operation results are ambiguous
> Here the stringified numeric values are being Implicitly casted where the 
> others are just ignored instead of throwing an exception
> In my view these things can lead to incorrect results if dataset is not 
> properly observed. 
> Also SQL-99 standard discourages implicit casting to avoid such things.
> https://users.dcc.uchile.cl/~cgutierr/cursos/BD/standards.pdf
> The same implicit casting is also there for UDFs and aggregation functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to