Hi all, I'd like to raise a discussion here about null-handling of primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ].
After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken because now we can't use reflection to get the parameter types of the Scala lambda. This leads to silent result changing, for example, with UDF defined as `val f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has different behavior between 2.4 and 3.0 when the input value of column x is null. Spark 2.4: null Spark 3.0: 0 Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend users to use the typed ones. However, recently I identified several valid use cases, e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]) ]. There are 3 solutions: 1. find a way to get Scala lambda parameter types by reflection (I tried it very hard but has no luck. The Java SAM type is so dynamic) 2. support case class as the input of typed Scala UDF, so at least people can still deal with struct type input column with UDF 3. add a new variant of untyped Scala UDF which users can specify input types I'd like to see more feedbacks or ideas about how to move forward. Thanks, Yi Wu -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org