[
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135264#comment-15135264
]
Michael Armbrust commented on SPARK-11725:
------------------------------------------
[[email protected]], unfortunatly I don't think that that is true.
{{Long}} and other primitive types inherit from {{AnyVal}} while only things
that inherit from {{AnyRef}} can be {{null}}. If you attempt a comparison
between a primitive and null, the compiler will tell you this.
{code}
scala> 1 == null
<console>:8: warning: comparing values of types Int and Null using `==' will
always yield false
1 == null
{code}
{quote}
Throwing an error when a null value cannot be passed to a UDF that has been
compiled to only accept nulls.
{quote}
While this might be reasonable in a greenfield situation, I don't think we can
change semantics on our users like that. We chose this semantic because its
pretty common for databases to use null for error conditions, rather than
failing the whole query.
{quote}
Using {{Option\[T\]}} as a UDF arg to signal that the function accepts nulls.
{quote}
I like this idea and I actually expected it to work. As you can see it already
works in datasets:
{code}
Seq((1, new Integer(1)), (2, null)).toDF().as[(Int, Option[Int])].collect()
res0: Array[(Int, Option[Int])] = Array((1,Some(1)), (2,None))
{code}
We should definitely be using the same logic when converting arguments for UDFs.
> Let UDF to handle null value
> ----------------------------
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Jeff Zhang
> Assignee: Wenchen Fan
> Priority: Blocker
> Labels: releasenotes
> Fix For: 1.6.0
>
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
> //////////////// Output ///////////////////////
> +----+-------+----+
> | age| name|age2|
> +----+-------+----+
> |null|Michael| 0|
> | 30| Andy| 31|
> | 19| Justin| 20|
> +----+-------+----+
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument
> * Let udf itself to handle null
> I would prefer the third option
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]