Marius Feteanu created SPARK-20212:
--------------------------------------
Summary: UDFs with Option[Primitive Type] don't work as expected
Key: SPARK-20212
URL: https://issues.apache.org/jira/browse/SPARK-20212
Project: Spark
Issue Type: Bug
Components: Optimizer
Affects Versions: 2.1.0
Reporter: Marius Feteanu
Priority: Minor
The documenation for ScalaUDF says:
{code:none}
Note that if you use primitive parameters, you are not able to check if it is
null or not, and the UDF will return null for you if the primitive input is
null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.
{code}
This works with boxed types:
{code:none}
import org.apache.spark.sql.functions.{col, udf}
import spark.implicits._
def is_null_box(x:java.lang.Long):String = {
x match {
case _:java.lang.Long => "Yep"
case null => "No man"
}
}
val is_null_box_udf = udf(is_null_box _)
val sample = (1L to 5L).toList.map(x=>new
java.lang.Long(x))++List[java.lang.Long](null, null)
val df = sample.toDF("col1")
df.select(is_null_box_udf(col("col1"))).show(10)
{code}
But does not work with Option\[Long\] as expected:
{code:none}
import org.apache.spark.sql.functions.{col, udf}
import spark.implicits._
def is_null_opt(x:Option[Long]):String = {
x match {
case Some(_:Long) => "Yep"
case None => "No man"
}
}
val is_null_opt_udf = udf(is_null_opt _)
val sample = (1L to 5L)
// This does not help: val sample = (1L to 5L).map(Some(_)).toList
val df = sample.toDF("col1")
df.select(is_null_opt_udf(col("col1"))).show(10)
{code}
That throws:
{code:none}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to
scala.Option
at $anonfun$1.apply(<console>:56)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
{code}
The current workaround is to use boxed types but it makes for code that looks
funny.
If you just use Long instead of boxing the code may break in subtle ways (i.e.
it does not fail it returns null). That's documented but easy to miss (i.e. not
part of the bug but if someone "corrects" boxed functions to use primitive
types then they might get surprising results):
{code:none}
import org.apache.spark.sql.functions.{col, udf, expr}
import spark.implicits._
def is_null_opt(x:Long):String = {
Option(x) match {
case Some(_:Long) => "Yep"
case None => "No man"
}
}
val is_null_opt_udf = udf(is_null_opt _)
val sample = (1L to 5L)
val df = sample.toDF("col3").select(expr("CASE WHEN col3=2 THEN NULL ELSE col3
END").alias("col3"))
df.printSchema
df.select(is_null_opt_udf(col("col3"))).show(10)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]