HyukjinKwon commented on a change in pull request #25130: [SPARK-28359][SQL][PYTHON][TESTS] Make integrated UDF tests robust by making UDFs (virtually) no-op URL: https://github.com/apache/spark/pull/25130#discussion_r302928358
########## File path: sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala ########## @@ -35,8 +36,9 @@ import org.apache.spark.sql.types.StringType * This object targets to integrate various UDF test cases so that Scalar UDF, Python UDF and * Scalar Pandas UDFs can be tested in SBT & Maven tests. * - * The available UDFs cast input to strings, which take one column as input and return a string - * type column as output. + * The available UDFs are special. It defines an UDF wrapped by cast. So, Input column is casted + * into string, UDF returns strings as are, and then output column is casted back to the input + * column. In this way, UDF is virtually no-op. Review comment: I think virtually identical before / after now. Meaning `select(a)` and `select(udf(a))` will be almost same. To clarify, complex types such as struct, array and map types cannot be roundtroup in string conversion - for complex tests let's workaround for those types. Most of other types can be roundtroup. This will let us to avoid to use ugly workarounds for this case like `CAST` or `upper`. Another one to note is that, since we should refer the input type to cast it back when we create expressions initially, it's required to use resolved expressions. Due to this, I had to add one restriction when it's used in Scala API (therefore it's unrelated when adding SQL tests at SPARK-27921). See https://github.com/apache/spark/pull/25130/files#diff-893577587405a826f9c454b675073f75R983 . The input columns should always be resolved via `df.col(...)` or `df(...)` for now when Python or Pandas UDFs are used in Scala APIs by `IntegratedUDFTestUtils`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
