koertkuipers commented on a change in pull request #27937:
URL: https://github.com/apache/spark/pull/27937#discussion_r422699979
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
##########
@@ -93,7 +93,7 @@ sealed abstract class UserDefinedFunction {
private[spark] case class SparkUserDefinedFunction(
f: AnyRef,
dataType: DataType,
- inputSchemas: Seq[Option[ScalaReflection.Schema]],
Review comment:
in my testing i also found the api now to be somewhat
inconsistent/confusing. basically sometimes
`CatalystTypeConverters.createToScalaConverter` is used and sometimes
`ExpressionEncoder.fromRow`, depending solely on if the the argument is a top
level struct or not. but `CatalystTypeConverters.createToScalaConverter` and
`ExpressionEncoder.fromRow` behave very differently, leading to inconsistent
application.
for example this (contrived) usage works:
```
case class Person(name: String, age: Option[Int])
Seq((1, Person("john", Some(55))), (2, Person("mary", None))).toDF("id",
"person").withColumn("age", udf{ p: Person1 => p.age }.apply(col("person")))
```
but this does not:
```
Seq((1, Seq(Person("john", Some(55)))), (2, Seq(Person("mary",
None)))).toDF("id", "persons").withColumn("ages", udf{ s: Seq[Person1] =>
s.map(_.age) }.apply(col("persons")))
```
and while Option works nicely in Person case class (and also in tuples)
Option does not work in a simple Seq:
```
Seq(Seq(Some(1), None)).toDF.withColumn("value", udf{ s: Seq[Option[Int]] =>
s.map(_.map(_ + 1)) }.apply(col("value")) )
```
this inconsistency will be hard to understand.
finally let me give some background why i am a little nervous about this
change...
spark udfs have been somewhat limited for a long time. no support for case
class, tuples, options. so many libraries have worked around that by defining
their own udfs on top on SparkUserDefinedFunction. we do this inhouse too. it
is easy to do this with type classes thanks to the composability of
inputSchemas.
so now you replaced inputSchemas with inputEncoders. but ExpressionEncoder
and TypeTags are not composable. i do not see a way for us to build on top of
this for our own inhouse udfs. so then the alternative for us is to abandon our
inhouse udfs and start using spark's udfs again, which now support case classes
and tuples, which is nice! but the inconsistency of the api and lack of support
for option makes that currently not viable to me. i realize this is a spark
internal api and this is entirely my own problem. but i thought it was worth
pointing out because i suspect i am not the only one that has done this. i
think this is one of the more typical workarounds people have done using spark
(and i am aware of multiple implementations of this workaround).
sorry for the long posts(s)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]