[ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135447#comment-15135447
 ] 

Cristian Opris commented on SPARK-11725:
----------------------------------------

You're right, sorry it seems I lost track of which Scala weirdness applies 
here: 

{code}
scala> def i : Int = (null : java.lang.Integer)
i: Int

scala> i == null
<console>:22: warning: comparing values of types Int and Null using `==' will 
always yield false
              i == null
                ^
java.lang.NullPointerException
        at scala.Predef$.Integer2int(Predef.scala:392)
{code}

At least the previous behaviour of passing 0 as default value was consistent 
with this specific Scala weirdness:

{code}
scala> val i: Int  = null.asInstanceOf[Int]
i: Int = 0
{code}

But the current behaviour of "null propagation" seems even more confusing 
semantically, since for example one can have an UDF with multiple args and just 
not calling the UDF when any of the args is null hardly can make any sense 
semantically.

{code}
def udf(i : Int, l: Long, s: String) = ...

sql("select udf(i, l, s) from df") 
{code}

I understand the need to not break existing code but perhaps having a more 
clear and documented UDF spec perhaps with the use of Option would really help 
here.

> Let UDF to handle null value
> ----------------------------
>
>                 Key: SPARK-11725
>                 URL: https://issues.apache.org/jira/browse/SPARK-11725
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Jeff Zhang
>            Assignee: Wenchen Fan
>            Priority: Blocker
>              Labels: releasenotes
>             Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
> //////////////// Output ///////////////////////
> +----+-------+----+
> | age|   name|age2|
> +----+-------+----+
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> +----+-------+----+
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to