Hi all, I'd like to raise a discussion here about null-handling of
primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ].

After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken
because now we can't use reflection to get the parameter types of the Scala
lambda.
This leads to silent result changing, for example, with UDF defined as `val
f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has
different
behavior between 2.4 and 3.0 when the input value of column x is null.

Spark 2.4:  null
Spark 3.0:  0

Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend users
to use the typed ones. However, recently I identified several valid use
cases,
e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema
cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f:
Function1[A1, RT]) ].

There are 3 solutions:
1. find a way to get Scala lambda parameter types by reflection (I tried it
very hard but has no luck. The Java SAM type is so dynamic)
2. support case class as the input of typed Scala UDF, so at least people
can still deal with struct type input column with UDF
3. add a new variant of untyped Scala UDF which users can specify input
types

I'd like to see more feedbacks or ideas about how to move forward.

Thanks,
Yi Wu



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to