[
https://issues.apache.org/jira/browse/SPARK-20212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-20212.
----------------------------------
Resolution: Duplicate
It sounds a duplicate of SPARK-12648 and I think the fix should be in the same
PR if needed. I am resolving this. Please reopen this if I misunderstood.
> UDFs with Option[Primitive Type] don't work as expected
> -------------------------------------------------------
>
> Key: SPARK-20212
> URL: https://issues.apache.org/jira/browse/SPARK-20212
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Marius Feteanu
> Priority: Minor
>
> The documenation for ScalaUDF says:
> {code:none}
> Note that if you use primitive parameters, you are not able to check if it is
> null or not, and the UDF will return null for you if the primitive input is
> null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.
> {code}
> This works with boxed types:
> {code:none}
> import org.apache.spark.sql.functions.{col, udf}
> import spark.implicits._
> def is_null_box(x:java.lang.Long):String = {
> x match {
> case _:java.lang.Long => "Yep"
> case null => "No man"
> }
> }
> val is_null_box_udf = udf(is_null_box _)
> val sample = (1L to 5L).toList.map(x=>new
> java.lang.Long(x))++List[java.lang.Long](null, null)
> val df = sample.toDF("col1")
> df.select(is_null_box_udf(col("col1"))).show(10)
> {code}
> But does not work with Option\[Long\] as expected:
> {code:none}
> import org.apache.spark.sql.functions.{col, udf}
> import spark.implicits._
> def is_null_opt(x:Option[Long]):String = {
> x match {
> case Some(_:Long) => "Yep"
> case None => "No man"
> }
> }
> val is_null_opt_udf = udf(is_null_opt _)
> val sample = (1L to 5L)
> // This does not help: val sample = (1L to 5L).map(Some(_)).toList
> val df = sample.toDF("col1")
> df.select(is_null_opt_udf(col("col1"))).show(10)
> {code}
> That throws:
> {code:none}
> Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to
> scala.Option
> at $anonfun$1.apply(<console>:56)
> at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
> at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
> at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
> {code}
> The current workaround is to use boxed types but it makes for code that looks
> funny.
> If you just use Long instead of boxing the code may break in subtle ways
> (i.e. it does not fail it returns null). That's documented but easy to miss
> (i.e. not part of the bug but if someone "corrects" boxed functions to use
> primitive types then they might get surprising results):
> {code:none}
> import org.apache.spark.sql.functions.{col, udf, expr}
> import spark.implicits._
> def is_null_opt(x:Long):String = {
> Option(x) match {
> case Some(_:Long) => "Yep"
> case None => "No man"
> }
> }
> val is_null_opt_udf = udf(is_null_opt _)
> val sample = (1L to 5L)
> val df = sample.toDF("col3").select(expr("CASE WHEN col3=2 THEN NULL ELSE
> col3 END").alias("col3"))
> df.printSchema
> df.select(is_null_opt_udf(col("col3"))).show(10)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]