[jira] [Commented] (SPARK-11725) Let UDF to handle null value

Michael Armbrust (JIRA) Fri, 05 Feb 2016 15:37:40 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135264#comment-15135264
 ]


Michael Armbrust commented on SPARK-11725:
------------------------------------------

[[email protected]], unfortunatly I don't think that that is true.  
{{Long}} and other primitive types inherit from {{AnyVal}} while only things 
that inherit from {{AnyRef}} can be {{null}}.  If you attempt a comparison 
between a primitive and null, the compiler will tell you this.

{code}
scala> 1 == null
<console>:8: warning: comparing values of types Int and Null using `==' will 
always yield false
              1 == null
{code}

{quote}
Throwing an error when a null value cannot be passed to a UDF that has been 
compiled to only accept nulls.
{quote}

While this might be reasonable in a greenfield situation, I don't think we can 
change semantics on our users like that.  We chose this semantic because its 
pretty common for databases to use null for error conditions, rather than 
failing the whole query.

{quote}
Using {{Option\[T\]}} as a UDF arg to signal that the function accepts nulls.
{quote}

I like this idea and I actually expected it to work.  As you can see it already 
works in datasets:

{code}
Seq((1, new Integer(1)), (2, null)).toDF().as[(Int, Option[Int])].collect()
res0: Array[(Int, Option[Int])] = Array((1,Some(1)), (2,None))
{code}

We should definitely be using the same logic when converting arguments for UDFs.

> Let UDF to handle null value
> ----------------------------
>
>                 Key: SPARK-11725
>                 URL: https://issues.apache.org/jira/browse/SPARK-11725
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Jeff Zhang
>            Assignee: Wenchen Fan
>            Priority: Blocker
>              Labels: releasenotes
>             Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
> //////////////// Output ///////////////////////
> +----+-------+----+
> | age|   name|age2|
> +----+-------+----+
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> +----+-------+----+
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-11725) Let UDF to handle null value

Reply via email to