[ 
https://issues.apache.org/jira/browse/SPARK-20212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20212.
----------------------------------
    Resolution: Duplicate

It sounds a duplicate of SPARK-12648 and I think the fix should be in the same 
PR if needed. I am resolving this. Please reopen this if I misunderstood.

> UDFs with Option[Primitive Type] don't work as expected
> -------------------------------------------------------
>
>                 Key: SPARK-20212
>                 URL: https://issues.apache.org/jira/browse/SPARK-20212
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Marius Feteanu
>            Priority: Minor
>
> The documenation for ScalaUDF says:
> {code:none}
> Note that if you use primitive parameters, you are not able to check if it is
> null or not, and the UDF will return null for you if the primitive input is
> null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.
> {code}
> This works with boxed types:
> {code:none}
> import org.apache.spark.sql.functions.{col, udf}
> import spark.implicits._
> def is_null_box(x:java.lang.Long):String = {
>     x match {
>         case _:java.lang.Long => "Yep"
>         case null => "No man"
>     }
> }
> val is_null_box_udf = udf(is_null_box _)
> val sample = (1L to 5L).toList.map(x=>new 
> java.lang.Long(x))++List[java.lang.Long](null, null)
> val df = sample.toDF("col1")
> df.select(is_null_box_udf(col("col1"))).show(10)
> {code}
> But does not work with Option\[Long\] as expected:
> {code:none}
> import org.apache.spark.sql.functions.{col, udf}
> import spark.implicits._
> def is_null_opt(x:Option[Long]):String = {
>     x match {
>         case Some(_:Long) => "Yep"
>         case None => "No man"
>     }
> }
> val is_null_opt_udf = udf(is_null_opt _)
> val sample = (1L to 5L)
> // This does not help: val sample = (1L to 5L).map(Some(_)).toList
> val df = sample.toDF("col1")
> df.select(is_null_opt_udf(col("col1"))).show(10)
> {code}
> That throws:
> {code:none}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
> scala.Option
>   at $anonfun$1.apply(<console>:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
> {code}
> The current workaround is to use boxed types but it makes for code that looks 
> funny.
> If you just use Long instead of boxing the code may break in subtle ways 
> (i.e.  it does not fail it returns null). That's documented but easy to miss 
> (i.e. not part of the bug but if someone "corrects" boxed functions to use 
> primitive types then they might get surprising results):
> {code:none}
> import org.apache.spark.sql.functions.{col, udf, expr}
> import spark.implicits._
> def is_null_opt(x:Long):String = {
>     Option(x) match {
>         case Some(_:Long) => "Yep"
>         case None => "No man"
>     }
> }
> val is_null_opt_udf = udf(is_null_opt _)
> val sample = (1L to 5L)
> val df = sample.toDF("col3").select(expr("CASE WHEN col3=2 THEN NULL ELSE 
> col3 END").alias("col3"))
> df.printSchema
> df.select(is_null_opt_udf(col("col3"))).show(10)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to