[
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135447#comment-15135447
]
Cristian Opris commented on SPARK-11725:
----------------------------------------
You're right, sorry it seems I lost track of which Scala weirdness applies
here:
{code}
scala> def i : Int = (null : java.lang.Integer)
i: Int
scala> i == null
<console>:22: warning: comparing values of types Int and Null using `==' will
always yield false
i == null
^
java.lang.NullPointerException
at scala.Predef$.Integer2int(Predef.scala:392)
{code}
At least the previous behaviour of passing 0 as default value was consistent
with this specific Scala weirdness:
{code}
scala> val i: Int = null.asInstanceOf[Int]
i: Int = 0
{code}
But the current behaviour of "null propagation" seems even more confusing
semantically, since for example one can have an UDF with multiple args and just
not calling the UDF when any of the args is null hardly can make any sense
semantically.
{code}
def udf(i : Int, l: Long, s: String) = ...
sql("select udf(i, l, s) from df")
{code}
I understand the need to not break existing code but perhaps having a more
clear and documented UDF spec perhaps with the use of Option would really help
here.
> Let UDF to handle null value
> ----------------------------
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Jeff Zhang
> Assignee: Wenchen Fan
> Priority: Blocker
> Labels: releasenotes
> Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
> //////////////// Output ///////////////////////
> +----+-------+----+
> | age| name|age2|
> +----+-------+----+
> |null|Michael| 0|
> | 30| Andy| 31|
> | 19| Justin| 20|
> +----+-------+----+
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument
> * Let udf itself to handle null
> I would prefer the third option
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]