[GitHub] [spark] peter-toth commented on a change in pull request #31955: [SPARK-34829][SQL] Fix typed ScalaUDF result conversion

GitBox Thu, 25 Mar 2021 01:36:01 -0700


peter-toth commented on a change in pull request #31955:
URL: https://github.com/apache/spark/pull/31955#discussion_r601199084




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
##########
@@ -124,10 +124,10 @@ case class ScalaUDF(
     val toRow = enc.createSerializer().asInstanceOf[Any => Any]
     if (enc.isSerializedAsStructForTopLevel) {
       value: Any =>
-        if (value == null) null else toRow(value).asInstanceOf[InternalRow]
+        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].copy()
     } else {
       value: Any =>
-        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].get(0, dataType)
+        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].copy().get(0, dataType)

Review comment:
       The source of the issue is that `ScalaUDF` is called multiple times from 
`ArrayTransform` (or any other higher order function). But each time the same 
`resultConverter` uses the same expression encoder to serialize the UDF's 
result into an `InternalRow` 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L124)
 but the encoders are are allowed to return the same instance of `InternalRow` 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L382-L389)
 and the caller needs to make sure the row is copied if needed. I think 
https://github.com/apache/spark/pull/28979 simply just forgot to add `.copy()` 
after these invocations.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -2755,6 +2755,20 @@ class DataFrameSuite extends QueryTest
     )
     checkAnswer(test.select($"best_name.name"), Row("bob") :: Row("bob") :: 
Row("sam") :: Nil)
   }
+
+  test("SPARK-34829: Typed ScalaUDF result conversion works") {
+    val reverse = udf((s: String) => s.reverse)
+    val df = Seq(Array("abc", "def")).toDF("array")
+    val test = df.select(transform(col("array"), s => reverse(s)))
+    checkAnswer(test, Row(Array("cba", "fed")) :: Nil)

Review comment:
       I don't think there is a codegen path for `ArrayTransform`. It can call 
`ScalaUDF.eval()` on its interpreted path 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L282)
 only.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on a change in pull request #31955: [SPARK-34829][SQL] Fix typed ScalaUDF result conversion

Reply via email to