[GitHub] [spark] koertkuipers commented on a change in pull request #27937: [SPARK-30127][SQL] Support case class parameter for typed Scala UDF

GitBox Sun, 10 May 2020 14:03:17 -0700


koertkuipers commented on a change in pull request #27937:
URL: https://github.com/apache/spark/pull/27937#discussion_r422699979




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
##########
@@ -93,7 +93,7 @@ sealed abstract class UserDefinedFunction {
 private[spark] case class SparkUserDefinedFunction(
     f: AnyRef,
     dataType: DataType,
-    inputSchemas: Seq[Option[ScalaReflection.Schema]],

Review comment:
       in my testing i also found the api now to be somewhat 
inconsistent/confusing. basically sometimes 
`CatalystTypeConverters.createToScalaConverter` is used and sometimes 
`ExpressionEncoder.fromRow`, depending solely on if the the argument is a top 
level struct or not. but `CatalystTypeConverters.createToScalaConverter` and 
`ExpressionEncoder.fromRow` behave very differently, leading to inconsistent 
application.
   
   for example this (contrived) usage works:
   ```
   case class Person(name: String, age: Option[Int])
   Seq((1, Person("john", Some(55))), (2, Person("mary", None))).toDF("id", 
"person").withColumn("age", udf{ p: Person1 => p.age }.apply(col("person")))
   ```
   but this does not:
   ```
   Seq((1, Seq(Person("john", Some(55)))), (2, Seq(Person("mary", 
None)))).toDF("id", "persons").withColumn("ages", udf{ s: Seq[Person1] => 
s.map(_.age) }.apply(col("persons")))
   ```
   and while Option works nicely in Person case class (and also in tuples) 
Option does not work in a simple Seq:
   ```
   Seq(Seq(Some(1), None)).toDF.withColumn("value", udf{ s: Seq[Option[Int]] => 
s.map(_.map(_ + 1)) }.apply(col("value")) )
   ```
   this inconsistency will be hard to understand.
   
   finally let me give some background why i am a little nervous about this 
change...
   spark udfs have been somewhat limited for a long time. no support for case 
class, tuples, options. so many libraries have worked around that by defining 
their own udfs on top on SparkUserDefinedFunction. we do this inhouse too. it 
is easy to do this with type classes thanks to the composability of 
inputSchemas.
   so now you replaced inputSchemas with inputEncoders. but ExpressionEncoder 
and TypeTags are not composable. i do not see a way for us to build on top of 
this for our own inhouse udfs. so then the alternative for us is to abandon our 
inhouse udfs and start using spark's udfs again, which now support case classes 
and tuples, which is nice! but the inconsistency of the api and lack of support 
for option makes that currently not viable to me. i realize this is a spark 
internal api and this is entirely my own problem. but i thought it was worth 
pointing out because i suspect i am not the only one that has done this. i 
think this is one of the more typical workarounds people have done using spark 
(and i am aware of multiple implementations of this workaround).
   
   sorry for the long posts(s)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] koertkuipers commented on a change in pull request #27937: [SPARK-30127][SQL] Support case class parameter for typed Scala UDF

Reply via email to